Proposal:
A single posting in an inverted index will be respresented by the following interface:
public interface Posting
{
/** Return the document id of the current posting */
public int getDocId();
/** Return the frequency of the term in the current document */
public int getFrequency();
public int getDocumentLength();
}
Note that getDocumentLength() is attached the posting. The intention is that using the current Terrier data structures, this will be accessed from the DocumentIndex. However, for very large collections, the document index is expensive to keep in memory, and hence it would be beneficial to put document statistics into the posting list.
An interface will be implemented when a Posting object is iterable, by calling the next() method. NB: java.util.Iterable will not be implemented, as we dont want to create a new Posing object for each posting in the posting list.
public interface IterablePosting extends Closable, Posting
{
/** move to the next document.
* return false iff end of posting list has been met */
public boolean next() throws IOException;
}
In the case where posting lists are sorted by docid and skipping tables are supported, the following extended interface will be supported:
public interface SkippablePosting extends IterablePosting
{
/** move as far as desiredDocid. Stops as soon as getDocid() >= desiredDocid.
* Use getDocid() to determine what document was moved to. Only works on docid sorted
* postings lists.
* @return false iff end of posting list has been met */
public boolean next(int desiredDocid) throws IOException;
}
Posting lists can also contain block (position) information. This will be represented as another extension of Posting:
public interface BlockPosting extends Posting
{
public int[] getPositions();
}
In the future, for TR-13, field frequency information will be encapsulated in the inverted index. We will support this by a further extension on Posting with additional field statistics.
How will these Posting classes be used? Well, currently a weighting model is only passed a frequency and a document length. To support future extensions, the WeightingModel would have the following interface:
public abstract class WeightingModel
{
/* from the lexicon */
public void setEntryStatistics(EntryStatistics ts);
/* from the index */
public void setCollectionStatistics(CollectionStatistics cs);
/* do the actual scoring */
public abstract double score(Posting p, double currentScore?);
/* access to parameter settings */
public void setRequest(Request q); }
Proposal:
A single posting in an inverted index will be respresented by the following interface:
Note that getDocumentLength() is attached the posting. The intention is that using the current Terrier data structures, this will be accessed from the DocumentIndex. However, for very large collections, the document index is expensive to keep in memory, and hence it would be beneficial to put document statistics into the posting list.
An interface will be implemented when a Posting object is iterable, by calling the next() method. NB: java.util.Iterable will not be implemented, as we dont want to create a new Posing object for each posting in the posting list.
In the case where posting lists are sorted by docid and skipping tables are supported, the following extended interface will be supported:
Posting lists can also contain block (position) information. This will be represented as another extension of Posting:
In the future, for
TR-13, field frequency information will be encapsulated in the inverted index. We will support this by a further extension on Posting with additional field statistics.How will these Posting classes be used? Well, currently a weighting model is only passed a frequency and a document length. To support future extensions, the WeightingModel would have the following interface:
TR-13, field frequency information will be encapsulated in the inverted index. We will support this by a further extension on Posting with additional field statistics. How will these Posting classes be used? Well, currently a weighting model is only passed a frequency and a document length. To support future extensions, the WeightingModel would have the following interface: