Terrier Core

Iterable postings

Details

  • Type: New Feature New Feature
  • Status: Resolved Resolved
  • Priority: Major Major
  • Resolution: Duplicate
  • Affects Version/s: None
  • Fix Version/s: 2.2.1
  • Component/s: None
  • Description:
    Hide
    Terrier currently loads all postings for a given term into memory. This is very space in-efficient, and the need to find large amounts of memory can slow matters down. We should move to an iterable posting design, where the posting list is not fully decompressed at once.
    Show
    Terrier currently loads all postings for a given term into memory. This is very space in-efficient, and the need to find large amounts of memory can slow matters down. We should move to an iterable posting design, where the posting list is not fully decompressed at once.
  1. TR-20.v1.patch
    (55 kB)
    Craig Macdonald
    18/Mar/09 6:36 PM

Issue Links

Activity

Hide
Craig Macdonald added a comment - 12/Mar/09 7:11 PM

Proposal:

A single posting in an inverted index will be respresented by the following interface:

public interface Posting 
{
    /** Return the document id of the current posting */
    public int getDocId();
    /** Return the frequency of the term in the current document */
    public int getFrequency();
    //usually uses the DocumentIndex, may do otherwise if doclength in posting list 
    public int getDocumentLength();
}

Note that getDocumentLength() is attached the posting. The intention is that using the current Terrier data structures, this will be accessed from the DocumentIndex. However, for very large collections, the document index is expensive to keep in memory, and hence it would be beneficial to put document statistics into the posting list.

An interface will be implemented when a Posting object is iterable, by calling the next() method. NB: java.util.Iterable will not be implemented, as we dont want to create a new Posing object for each posting in the posting list.

public interface IterablePosting extends Closable, Posting
{
    //stuff to move forward in a posting list or close it. Could separate this to a separate interface? */
    /** move to the next document.
      * return false iff end of posting list has been met */
    public boolean next() throws IOException;
}

In the case where posting lists are sorted by docid and skipping tables are supported, the following extended interface will be supported:

public interface SkippablePosting extends IterablePosting
{
    /** move as far as desiredDocid. Stops as soon as getDocid() >= desiredDocid.
      * Use getDocid() to determine what document was moved to. Only works on docid sorted
      * postings lists.
      * @return false iff end of posting list has been met */
    public boolean next(int desiredDocid) throws IOException;
}

Posting lists can also contain block (position) information. This will be represented as another extension of Posting:

public interface BlockPosting extends Posting
{
    public int[] getPositions();
}

In the future, for TR-13, field frequency information will be encapsulated in the inverted index. We will support this by a further extension on Posting with additional field statistics.

How will these Posting classes be used? Well, currently a weighting model is only passed a frequency and a document length. To support future extensions, the WeightingModel would have the following interface:

public abstract class WeightingModel
{
    /* from the lexicon */
    public void setEntryStatistics(EntryStatistics ts);
    /* from the index */
    public void setCollectionStatistics(CollectionStatistics cs);
    /* do the actual scoring */
    public abstract double score(Posting p, double currentScore?);
    /* access to parameter settings */
    public void setRequest(Request q); //Request will give access to parameters
}
Show
Craig Macdonald added a comment - 12/Mar/09 7:11 PM Proposal: A single posting in an inverted index will be respresented by the following interface:
public interface Posting 
{
    /** Return the document id of the current posting */
    public int getDocId();
    /** Return the frequency of the term in the current document */
    public int getFrequency();
    //usually uses the DocumentIndex, may do otherwise if doclength in posting list 
    public int getDocumentLength();
}
Note that getDocumentLength() is attached the posting. The intention is that using the current Terrier data structures, this will be accessed from the DocumentIndex. However, for very large collections, the document index is expensive to keep in memory, and hence it would be beneficial to put document statistics into the posting list. An interface will be implemented when a Posting object is iterable, by calling the next() method. NB: java.util.Iterable will not be implemented, as we dont want to create a new Posing object for each posting in the posting list.
public interface IterablePosting extends Closable, Posting
{
    //stuff to move forward in a posting list or close it. Could separate this to a separate interface? */
    /** move to the next document.
      * return false iff end of posting list has been met */
    public boolean next() throws IOException;
}
In the case where posting lists are sorted by docid and skipping tables are supported, the following extended interface will be supported:
public interface SkippablePosting extends IterablePosting
{
    /** move as far as desiredDocid. Stops as soon as getDocid() >= desiredDocid.
      * Use getDocid() to determine what document was moved to. Only works on docid sorted
      * postings lists.
      * @return false iff end of posting list has been met */
    public boolean next(int desiredDocid) throws IOException;
}
Posting lists can also contain block (position) information. This will be represented as another extension of Posting:
public interface BlockPosting extends Posting
{
    public int[] getPositions();
}
In the future, for TR-13, field frequency information will be encapsulated in the inverted index. We will support this by a further extension on Posting with additional field statistics. How will these Posting classes be used? Well, currently a weighting model is only passed a frequency and a document length. To support future extensions, the WeightingModel would have the following interface:
public abstract class WeightingModel
{
    /* from the lexicon */
    public void setEntryStatistics(EntryStatistics ts);
    /* from the index */
    public void setCollectionStatistics(CollectionStatistics cs);
    /* do the actual scoring */
    public abstract double score(Posting p, double currentScore?);
    /* access to parameter settings */
    public void setRequest(Request q); //Request will give access to parameters
}
Hide
Craig Macdonald added a comment - 13/Mar/09 4:55 PM

I'm thinking of adding a void prepare() method to WeightingModel, so that each model can do any pre-calculations on the background statistics that it is likely to do time and time again.

Matching would then look like:

wmodel.setCollectionStatistics(index.getCollectionStatistics());
for (String term : query)
{
  LexiconEntry le = lexicon.getLexiconEntry(term);
  if (le == null)
    continue;
  wmodel.setEntryStatistics(le);
  wmodel.prepare();
  IterablePosting p = invertedIndex.getPostings(le);
  do{
    score[p.getDocumentId()] += wmodel.score(p);
  } while (p.next());
}

Note:
1. wmodel.prepare() is called after the statistics are prepared prior to scoring a posting list
2. a do{}while loop is used as IterablePosting starts off with the first posting loaded.

Show
Craig Macdonald added a comment - 13/Mar/09 4:55 PM I'm thinking of adding a void prepare() method to WeightingModel, so that each model can do any pre-calculations on the background statistics that it is likely to do time and time again. Matching would then look like:
wmodel.setCollectionStatistics(index.getCollectionStatistics());
for (String term : query)
{
  LexiconEntry le = lexicon.getLexiconEntry(term);
  if (le == null)
    continue;
  wmodel.setEntryStatistics(le);
  wmodel.prepare();
  IterablePosting p = invertedIndex.getPostings(le);
  do{
    score[p.getDocumentId()] += wmodel.score(p);
  } while (p.next());
}
Note: 1. wmodel.prepare() is called after the statistics are prepared prior to scoring a posting list 2. a do{}while loop is used as IterablePosting starts off with the first posting loaded.
Hide
Craig Macdonald added a comment - 17/Mar/09 12:50 PM

Work on this patch is progressing. Work is currently in related changes to Matching, in particular in relation to the TermScoreModifiers. TSMs are permitted to alter the score of documents as they are scored for a given term. There are currently three TSMs:

  • RequiredTermModifier just checks that a term actually exists in a document
  • TermInFieldModifier checks that it exists in a given field
  • FieldScoreModifier is a primitive field weighting model

As fields are being rewritten as part of TR-13, then FieldScoreModifier will be removed. TermInFieldModifier will be dropped for the time being. This leaves RequiredTermModifier.

I am currently considering whether TSMs can be implemented as WeightingModels. Do they actually need the score to alter? If they do, then WeightingModel has to look like:

score = wmodel.score(p, score);
Show
Craig Macdonald added a comment - 17/Mar/09 12:50 PM Work on this patch is progressing. Work is currently in related changes to Matching, in particular in relation to the TermScoreModifiers. TSMs are permitted to alter the score of documents as they are scored for a given term. There are currently three TSMs:
  • RequiredTermModifier just checks that a term actually exists in a document
  • TermInFieldModifier checks that it exists in a given field
  • FieldScoreModifier is a primitive field weighting model
As fields are being rewritten as part of TR-13, then FieldScoreModifier will be removed. TermInFieldModifier will be dropped for the time being. This leaves RequiredTermModifier. I am currently considering whether TSMs can be implemented as WeightingModels. Do they actually need the score to alter? If they do, then WeightingModel has to look like:
score = wmodel.score(p, score);
Hide
Craig Macdonald added a comment - 18/Mar/09 6:36 PM

First version of a patch for this issue.

Notes:

  • TermScoreModifiers are not used (and will be removed shortly).
  • WeightingModels are added for specific query terms into MatchingQueryTerms. Matching then asks MatchingQueryTerms which WeightingModels it should apply for a given query term.
  • All end-to-end tests pass, except those involving fields. As noted above, the fields implementation in Terrier will be evolving in TR-13, which will be directly after this patch. This patch has some of the foot-work for TR-13 - e.g. the FieldPosting interface is in place.

Questions:

  1. Some more work requires to be done wrt to Manager, WeightingModels and MatchingQueryTerms. When are they instantiated etc, when should a term not use the default WeightingModel etc.
  2. We need to decide what we use: WeightingModel or Model. Either only one is required, or they have separate purposes, which is not clear from their specification or documentation. Also, how does WeightingModel and Model bear in QueryExpansion?
  3. If we move BasicIterablePosting and BlockIterablePosting out of InvertedIndex and BlockInvertedIndex, then InvertedIndex could be made to take a posting class implementation. However, if we are maintaining the existing getDocuments() methods for InvertedIndex, then this will not be possible. Discuss
  4. Relationships with other classes: DirectIndex, DirectIndexInputStream & InvertedIndexInputStreams need careful consideration. Should they all use the same posting interface?

Efficiency-based testing still required.

Show
Craig Macdonald added a comment - 18/Mar/09 6:36 PM First version of a patch for this issue. Notes:
  • TermScoreModifiers are not used (and will be removed shortly).
  • WeightingModels are added for specific query terms into MatchingQueryTerms. Matching then asks MatchingQueryTerms which WeightingModels it should apply for a given query term.
  • All end-to-end tests pass, except those involving fields. As noted above, the fields implementation in Terrier will be evolving in TR-13, which will be directly after this patch. This patch has some of the foot-work for TR-13 - e.g. the FieldPosting interface is in place.
Questions:
  1. Some more work requires to be done wrt to Manager, WeightingModels and MatchingQueryTerms. When are they instantiated etc, when should a term not use the default WeightingModel etc.
  2. We need to decide what we use: WeightingModel or Model. Either only one is required, or they have separate purposes, which is not clear from their specification or documentation. Also, how does WeightingModel and Model bear in QueryExpansion?
  3. If we move BasicIterablePosting and BlockIterablePosting out of InvertedIndex and BlockInvertedIndex, then InvertedIndex could be made to take a posting class implementation. However, if we are maintaining the existing getDocuments() methods for InvertedIndex, then this will not be possible. Discuss
  4. Relationships with other classes: DirectIndex, DirectIndexInputStream & InvertedIndexInputStreams need careful consideration. Should they all use the same posting interface?
Efficiency-based testing still required.
Hide
Craig Macdonald added a comment - 06/Apr/09 9:52 PM

Update on progress. Work on this patch integrated several other issues, such as TR-12 & TR-17. However, on efficiency testing, I have found that my working patch is noticeably slower than Terrier 2.x.

Using some profiling, I have identified that the problem is with my DocumentIndex. All statistical information about a document is wrapped in DocumentIndexEntry, which contains: document length (int), number of terms (int), direct index offset (long & byte).

Posting uses the DocumentIndex to satisfy the getDocumentLength() method as thus:

public int getDocumentLength() {
  return doi.getDocumentEntry(id).getDocumentLength();
}

The issue is that DocumentIndexEntry.readFields() has to read all information, even though all it needs is one int. Note that there is no disk IO, only decoding bytes to ints from a byte array in memory.

I'm not sure what to do here. (a) I liked DocumentIndex using an abstract DocumentIndex entry objects, however this appears too slow. It also allows generality for the fields setting (b) The direct index offsets could go to a different index structure, as they could also be kept on disk.

Show
Craig Macdonald added a comment - 06/Apr/09 9:52 PM Update on progress. Work on this patch integrated several other issues, such as TR-12 & TR-17. However, on efficiency testing, I have found that my working patch is noticeably slower than Terrier 2.x. Using some profiling, I have identified that the problem is with my DocumentIndex. All statistical information about a document is wrapped in DocumentIndexEntry, which contains: document length (int), number of terms (int), direct index offset (long & byte). Posting uses the DocumentIndex to satisfy the getDocumentLength() method as thus:
public int getDocumentLength() {
  return doi.getDocumentEntry(id).getDocumentLength();
}
The issue is that DocumentIndexEntry.readFields() has to read all information, even though all it needs is one int. Note that there is no disk IO, only decoding bytes to ints from a byte array in memory. I'm not sure what to do here. (a) I liked DocumentIndex using an abstract DocumentIndex entry objects, however this appears too slow. It also allows generality for the fields setting (b) The direct index offsets could go to a different index structure, as they could also be kept on disk.

People

Dates

  • Created:
    26/Feb/09 4:51 PM
    Updated:
    14/May/09 9:02 AM
    Resolved:
    29/Apr/09 3:39 PM