Terrier Core

Term Pipeline only supports token events

Details

  • Type: Improvement Improvement
  • Status: Resolved Resolved
  • Priority: Major Major
  • Resolution: Duplicate
  • Affects Version/s: None
  • Fix Version/s: None
  • Component/s: None
  • Description:
    Hide
    The TermPipeline introduced for Terrier 1 allows tokens to be transformed during index, e.g. by stemming, stopword removal, and more. The design has proved useful, however it has some limitations. In particular, the TermPipeline objects may require access to state information.

    For instance, consider the following examples which require statae:
     * POS tagger: needs to know when a sentence boundary occurs, and when the document ends. It also needs to decorate the tokens with POS somehow
     * Language-specific stemming: needs to know when the language of a document (or query) stream has changed

    To this end, there are in fact two problems:
    1. Access to events other than tokens
    2. Access to state associated with events: e.g. a document boundary has a document name, a token may have a position, and/or fields

    Show
    The TermPipeline introduced for Terrier 1 allows tokens to be transformed during index, e.g. by stemming, stopword removal, and more. The design has proved useful, however it has some limitations. In particular, the TermPipeline objects may require access to state information. For instance, consider the following examples which require statae:  * POS tagger: needs to know when a sentence boundary occurs, and when the document ends. It also needs to decorate the tokens with POS somehow  * Language-specific stemming: needs to know when the language of a document (or query) stream has changed To this end, there are in fact two problems: 1. Access to events other than tokens 2. Access to state associated with events: e.g. a document boundary has a document name, a token may have a position, and/or fields
  1. TR-10.v0.patch
    (5 kB)
    Craig Macdonald
    26/Mar/09 11:01 PM

Issue Links

Activity

Hide
Craig Macdonald added a comment - 13/Feb/09 5:24 PM

There are two designs in which this scheme could be carried out. In this comment, I enumerate the two design patterns,

[1] A DOM style method for every event-type:
e.g.

interface EventPipeline
{
 public void eventFieldChange(Set<String> fields);
 public void eventDocument(Map<String,String> documentInfo);
 public void eventToken(String token, int position, long byteOffset);
 public void eventSentenceBoundary(String boundaryMarker);
}

Advantages:

  • Events themselves are lightweight (no extra object creation for every token)

Disadvantages:

  • Recall that most implementations will only use eventToken(). However, every implementation of EventPipeline would have to implement forward the events onto the next object.
  • Difficult to add more event types.

[2] Use an abstract class to represent of an event. Document implementations can choose the type of events they wish to produce, each pipeline object can choose the events they wish to process. Other events should be passed onto the next pipeline object unchanged.

interface EventPipeline
{
 public void processEvent(Event e);
}

/** Event is base class for all events */
abstract class Event{}

class TokenEvent extends Event
{
 public String getToken();
 public void SetToken(String t);
 //token number - aka blocks
 public int getPosition();
 //byte offset in file/document
 public int getOffset();
 public Set<String> getFIelds();
}

class SentenceEvent extends Event {}

class DocumentEvent extends Event
{
 //key could URL, docno, filename, etc.
 public String getDocumentProperty(String key);
}

Advantages:

  • Event can be subclassed for more types of events
  • Not every event causes a whole slew of method calls.

Disadvantages:

  • Event objects have to be created for every event. This may mean a new Set<String> and a TokenEvent object for every token, as the state of an Event is mutable. We need to consider carefully whether these objects can be made (a) immutable, and (b) lightweight in that they can be pooled and re-used. Re-use is complicated because an EventPipeline object may not free immeditately after processEvent() returns. This is because a pipeline object may buffer tokens (e.g. upto a sentence or document boundary).

Can we have a discussion about which proposal is preferred? And any merits or disadvantages of either that I have missed. Which do people prefer, and does it cover all of their use cases?

Show
Craig Macdonald added a comment - 13/Feb/09 5:24 PM There are two designs in which this scheme could be carried out. In this comment, I enumerate the two design patterns, [1] A DOM style method for every event-type: e.g.
interface EventPipeline
{
 public void eventFieldChange(Set<String> fields);
 public void eventDocument(Map<String,String> documentInfo);
 public void eventToken(String token, int position, long byteOffset);
 public void eventSentenceBoundary(String boundaryMarker);
}
Advantages:
  • Events themselves are lightweight (no extra object creation for every token)
Disadvantages:
  • Recall that most implementations will only use eventToken(). However, every implementation of EventPipeline would have to implement forward the events onto the next object.
  • Difficult to add more event types.
[2] Use an abstract class to represent of an event. Document implementations can choose the type of events they wish to produce, each pipeline object can choose the events they wish to process. Other events should be passed onto the next pipeline object unchanged.
interface EventPipeline
{
 public void processEvent(Event e);
}

/** Event is base class for all events */
abstract class Event{}

class TokenEvent extends Event
{
 public String getToken();
 public void SetToken(String t);
 //token number - aka blocks
 public int getPosition();
 //byte offset in file/document
 public int getOffset();
 public Set<String> getFIelds();
}

class SentenceEvent extends Event {}

class DocumentEvent extends Event
{
 //key could URL, docno, filename, etc.
 public String getDocumentProperty(String key);
}
Advantages:
  • Event can be subclassed for more types of events
  • Not every event causes a whole slew of method calls.
Disadvantages:
  • Event objects have to be created for every event. This may mean a new Set<String> and a TokenEvent object for every token, as the state of an Event is mutable. We need to consider carefully whether these objects can be made (a) immutable, and (b) lightweight in that they can be pooled and re-used. Re-use is complicated because an EventPipeline object may not free immeditately after processEvent() returns. This is because a pipeline object may buffer tokens (e.g. upto a sentence or document boundary).
— Can we have a discussion about which proposal is preferred? And any merits or disadvantages of either that I have missed. Which do people prefer, and does it cover all of their use cases?
Hide
Giovanni Stilo added a comment - 13/Feb/09 8:42 PM

Craig i have some problem to follow u in the second case.
Could u express much better who will generate the event? and so on...
Thanks
G.

Show
Giovanni Stilo added a comment - 13/Feb/09 8:42 PM Craig i have some problem to follow u in the second case. Could u express much better who will generate the event? and so on... Thanks G.
Hide
Craig Macdonald added a comment - 13/Feb/09 11:05 PM

I hadn't thought about what would generate the event. Currently, the Indexer takes terms from the Documents, and passes these down the Term Pipeline. The rough algorithm in the current Indexer looks like:

for(Document document : collection)
{
 String token = null;
 while((token = document.getNextTerm())) != null)
 {
  termpipeline.processTerm(token);
 }
}

In the first instance, the Indexer could also generate the events. In this case it would look like:

for(Document document : collection)
{
 eventpipeline.processEvent(new StartDocumentEvent(document.getProperties));
 int counter = 0;
 while((token = document.getNextTerm()) != null)
 {
   eventpipeline.processEvent(new TokenEvent(token, counter++, document.getFields());
 }
 eventpipeline.processEvent(new EndDocumentEvent(document.getProperties));
}

This would be a suitable implementaion for the initial phase of implementation.

However, I would like the Document object to generate its own events, giving it full control over the events that it generates.

for(Document d : collection)
{
  eventpipeline.processEvent(new StartDocumentEvent(document.getProperties));
  document.tokenise(eventpipeline);
  eventpipeline.processEvent(new EndDocumentEvent(document.getProperties));
}
Show
Craig Macdonald added a comment - 13/Feb/09 11:05 PM I hadn't thought about what would generate the event. Currently, the Indexer takes terms from the Documents, and passes these down the Term Pipeline. The rough algorithm in the current Indexer looks like:
for(Document document : collection)
{
 String token = null;
 while((token = document.getNextTerm())) != null)
 {
  termpipeline.processTerm(token);
 }
}
In the first instance, the Indexer could also generate the events. In this case it would look like:
for(Document document : collection)
{
 eventpipeline.processEvent(new StartDocumentEvent(document.getProperties));
 int counter = 0;
 while((token = document.getNextTerm()) != null)
 {
   eventpipeline.processEvent(new TokenEvent(token, counter++, document.getFields());
 }
 eventpipeline.processEvent(new EndDocumentEvent(document.getProperties));
}
This would be a suitable implementaion for the initial phase of implementation. However, I would like the Document object to generate its own events, giving it full control over the events that it generates.
for(Document d : collection)
{
  eventpipeline.processEvent(new StartDocumentEvent(document.getProperties));
  document.tokenise(eventpipeline);
  eventpipeline.processEvent(new EndDocumentEvent(document.getProperties));
}
Hide
Giovanni Stilo added a comment - 15/Feb/09 8:52 PM

Have u consider a design like SAX ?
In SAX like model the "Document" itself provide to generate the Events.
I mean that generate just generic Event and pass through e generic Pipeline formed by
event processor. Each processor can process the event or not.
But to reach more flexible design i think that it's important to introduce the concept of context where each processor can put and can read all the information.
Then at "one point" there is a context processor that problably would perform the writing operation or can perform more complex operations that consider all the Context at the same time.
In this direction u would consider contex such a transposition of a document but it's more rich because it store all the information processed.
In this way consider also the contex-processor as generic component of the pipeline.
Remain to investigate how to manage the pipeline... now it's a push pipeline.
U can consider to a have a manger of the pipeline that provide the control of the flow.

Show
Giovanni Stilo added a comment - 15/Feb/09 8:52 PM Have u consider a design like SAX ? In SAX like model the "Document" itself provide to generate the Events. I mean that generate just generic Event and pass through e generic Pipeline formed by event processor. Each processor can process the event or not. But to reach more flexible design i think that it's important to introduce the concept of context where each processor can put and can read all the information. Then at "one point" there is a context processor that problably would perform the writing operation or can perform more complex operations that consider all the Context at the same time. In this direction u would consider contex such a transposition of a document but it's more rich because it store all the information processed. In this way consider also the contex-processor as generic component of the pipeline. Remain to investigate how to manage the pipeline... now it's a push pipeline. U can consider to a have a manger of the pipeline that provide the control of the flow.
Hide
Craig Macdonald added a comment - 16/Feb/09 12:49 PM

SAX has methods like:

public void startElement (String uri, String name, String qName, Attributes atts);
public void endElement (String uri, String name, String qName);
public void characters (char ch[], int start, int length);
//etc

I don't think we want to get in to the case where we have a method for every type of event. That isn't extensible.

I like the push pipeline that we have atm, as this means that there is not any external process managing the control. Each pipeline component can control what the next one is. This allows conditional branches, etc. If another process handled it, I guess that this would be more difficult.

Stilo, can you give some code examples of what you are proposing? It seems quite similar to option [2] above, which I prefer, with the exception of the event for every method type.

Show
Craig Macdonald added a comment - 16/Feb/09 12:49 PM SAX has methods like:
public void startElement (String uri, String name, String qName, Attributes atts);
public void endElement (String uri, String name, String qName);
public void characters (char ch[], int start, int length);
//etc
I don't think we want to get in to the case where we have a method for every type of event. That isn't extensible. I like the push pipeline that we have atm, as this means that there is not any external process managing the control. Each pipeline component can control what the next one is. This allows conditional branches, etc. If another process handled it, I guess that this would be more difficult. Stilo, can you give some code examples of what you are proposing? It seems quite similar to option [2] above, which I prefer, with the exception of the event for every method type.
Hide
Giovanni Stilo added a comment - 16/Feb/09 2:10 PM

Craig, u reffer to the handler function in our model ( pipeline object).
I just talk about the "parser" that will generate a generic Event.
Then the pipeline will process as Event and every pipeline object will
chose to performe operation or not.

Show
Giovanni Stilo added a comment - 16/Feb/09 2:10 PM Craig, u reffer to the handler function in our model ( pipeline object). I just talk about the "parser" that will generate a generic Event. Then the pipeline will process as Event and every pipeline object will chose to performe operation or not.
Hide
Craig Macdonald added a comment - 16/Feb/09 2:45 PM

Ok, so we're agreed on how to handle the events.
Just not what produces the events? The Document object itself, or something reading the Document object. One option is to allow both. The Indexer wraps the Document object in an class which produces the events, if the Document does not produce does not have support to produce those events in the first place.

interface EventProducer 
{
 public void produce(EventPipeline first);
}

In Indexer:
{
 if (Document d instanceof EventProducer)
 {
  ((EventProducer)d).process(firstPipeline);
 }
 else
 {
 defaultEventProducer.process(d, firstPipeline);
 }
}
Show
Craig Macdonald added a comment - 16/Feb/09 2:45 PM Ok, so we're agreed on how to handle the events. Just not what produces the events? The Document object itself, or something reading the Document object. One option is to allow both. The Indexer wraps the Document object in an class which produces the events, if the Document does not produce does not have support to produce those events in the first place.
interface EventProducer 
{
 public void produce(EventPipeline first);
}

In Indexer:
{
 if (Document d instanceof EventProducer)
 {
  ((EventProducer)d).process(firstPipeline);
 }
 else
 {
 defaultEventProducer.process(d, firstPipeline);
 }
}
Hide
Giovanni Stilo added a comment - 16/Feb/09 3:40 PM

I think should be something more flexible such as:

interface EventProducer
{
 public Event produceNextEvent();
}

In Indexer:
{
 Pipeline p= new Pipeline();
 Event e;
 Context c;

foreach( d in Collection){
  if (Document d instanceof EventProducer)
  {
   c= new Context();
   c.addDocument(d);
   while((e=d.produceNextEvent())!=null){
     c.addEvent(e);
   }
    p.process(c);
  }
 }
}

You can then chose how to manage the pipeline but this is another problem.
The upper code is just an example to have an idea, dosn't want to be the definitive design.
Then at this point i haven't got THE SOLUTION i'm just trying to do Brain Storming for me also.

Show
Giovanni Stilo added a comment - 16/Feb/09 3:40 PM I think should be something more flexible such as:
interface EventProducer
{
 public Event produceNextEvent();
}

In Indexer:
{
 Pipeline p= new Pipeline();
 Event e;
 Context c;

foreach( d in Collection){
  if (Document d instanceof EventProducer)
  {
   c= new Context();
   c.addDocument(d);
   while((e=d.produceNextEvent())!=null){
     c.addEvent(e);
   }
    p.process(c);
  }
 }
}
You can then chose how to manage the pipeline but this is another problem. The upper code is just an example to have an idea, dosn't want to be the definitive design. Then at this point i haven't got THE SOLUTION i'm just trying to do Brain Storming for me also.
Hide
Craig Macdonald added a comment - 17/Feb/09 11:49 AM

Advantage: So the context object adds the ability for a given pipeline phase to look forward and backward in the pipe?

I'm worried that this will increase memory requirements, as then all of a document has to be in memory (e.g. 3 objects for each of 100,000 tokens). This is a higher memory requirement than currently, where we are only incrementing counters for each term (cf DocumentPostingList).

Moreover, the event pipeline can already look forwards and backwards by buffering events. I have implementations which do this already.

Show
Craig Macdonald added a comment - 17/Feb/09 11:49 AM Advantage: So the context object adds the ability for a given pipeline phase to look forward and backward in the pipe? I'm worried that this will increase memory requirements, as then all of a document has to be in memory (e.g. 3 objects for each of 100,000 tokens). This is a higher memory requirement than currently, where we are only incrementing counters for each term (cf DocumentPostingList). Moreover, the event pipeline can already look forwards and backwards by buffering events. I have implementations which do this already.
Hide
Giovanni Stilo added a comment - 17/Feb/09 12:02 PM

Yes.
U have problably have to reuse object and don't need to have all in memory
especially if u consider 1 context for each document.
Then u can think context as some kind of buffering strategy.
At the end a think u should stil use terrier as is why u need to chenge it?

Show
Giovanni Stilo added a comment - 17/Feb/09 12:02 PM Yes. U have problably have to reuse object and don't need to have all in memory especially if u consider 1 context for each document. Then u can think context as some kind of buffering strategy. At the end a think u should stil use terrier as is why u need to chenge it?
Hide
Craig Macdonald added a comment - 17/Feb/09 6:52 PM

Yes.
U have problably have to reuse object and don't need to have all in memory
especially if u consider 1 context for each document.

I'm unclear here - are you suggesting that Context could swap events to disk for very large documents?

At the end a think u should stil use terrier as is why u need to chenge it?

I like the Terrier model at present, but it does need to evolve. I think that much is clear, from both Gianni's and my presentations in Rome, and the motivations in the original postfor this issue. Any use of the current model to address the existing problem results in un-standard code, where, with careful thought we could have an improved model, and easy code reuse between applications.

I'm trying to pursue one of two evolutions to the current model, rather than a revolution. However, it's good to discuss such changes to make sure we are evolving in the correct manner.

Show
Craig Macdonald added a comment - 17/Feb/09 6:52 PM
Yes. U have problably have to reuse object and don't need to have all in memory especially if u consider 1 context for each document.
I'm unclear here - are you suggesting that Context could swap events to disk for very large documents?
At the end a think u should stil use terrier as is why u need to chenge it?
I like the Terrier model at present, but it does need to evolve. I think that much is clear, from both Gianni's and my presentations in Rome, and the motivations in the original postfor this issue. Any use of the current model to address the existing problem results in un-standard code, where, with careful thought we could have an improved model, and easy code reuse between applications. I'm trying to pursue one of two evolutions to the current model, rather than a revolution. However, it's good to discuss such changes to make sure we are evolving in the correct manner.
Hide
Craig Macdonald added a comment - 01/Apr/11 3:18 PM

For the time being, TR-106 deals with the most salient point of this, the reset().

Show
Craig Macdonald added a comment - 01/Apr/11 3:18 PM For the time being, TR-106 deals with the most salient point of this, the reset().

People

Dates

  • Created:
    11/Feb/09 3:27 PM
    Updated:
    01/Apr/11 3:18 PM
    Resolved:
    01/Apr/11 3:18 PM