Class DocumentPostingList

  • All Implemented Interfaces:
    java.io.Serializable, org.apache.hadoop.io.Writable
    Direct Known Subclasses:
    BlockDocumentPostingList, FieldDocumentPostingList

    public class DocumentPostingList
    extends java.lang.Object
    implements org.apache.hadoop.io.Writable, java.io.Serializable
    Represents the postings of one document. Uses HashMaps internally.

    Properties:

    • indexing.avg.unique.terms.per.doc - number of unique terms per doc on average, used to tune the initial size of the hashmaps used in this class.
    See Also:
    Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected static int AVG_DOCUMENT_UNIQUE_TERMS
      number of unique terms per doc on average, used to tune the initial size of the hashmaps used in this class.
      protected int documentLength
      length of the document so far.
      protected gnu.trove.TObjectIntHashMap<java.lang.String> occurrences
      mapping term to tf mapping
    • Constructor Summary

      Constructors 
      Constructor Description
      DocumentPostingList()
      Create a new DocumentPostingList object
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void clear()
      Removes all postings from this document
      void forEachTerm​(gnu.trove.TObjectIntProcedure<java.lang.String> proc)
      Execute the specifed method for each term.
      int getDocumentLength()
      Returns the total number of tokens in this document
      DocumentIndexEntry getDocumentStatistics()
      Return a DocumentIndexEntry for this document
      int getFrequency​(java.lang.String term)
      Return the frequency of the specified term in this document
      int getNumberOfPointers()
      Returns the number of unique terms in this document.
      int[][] getPostings​(TermCodes termCodes)
      Returns the postings suitable to be written into the direct index.
      IterablePosting getPostings2​(TermCodes termCodes)
      Returns a posting iterator suitable to be written into the direct index.
      void insert​(int tf, java.lang.String term)
      Insert a term into the posting list of this document
      void insert​(java.lang.String term)
      Insert a term into the posting list of this document
      protected IterablePosting makePostingIterator​(java.lang.String[] _terms, int[] termIds)  
      void readFields​(java.io.DataInput in)  
      java.lang.String[] termSet()
      Returns all terms in this posting list
      void write​(java.io.DataOutput out)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • AVG_DOCUMENT_UNIQUE_TERMS

        protected static final int AVG_DOCUMENT_UNIQUE_TERMS
        number of unique terms per doc on average, used to tune the initial size of the hashmaps used in this class.
      • documentLength

        protected int documentLength
        length of the document so far. Sum of the term frequencies inserted so far.
      • occurrences

        protected final gnu.trove.TObjectIntHashMap<java.lang.String> occurrences
        mapping term to tf mapping
    • Constructor Detail

      • DocumentPostingList

        public DocumentPostingList()
        Create a new DocumentPostingList object
    • Method Detail

      • termSet

        public java.lang.String[] termSet()
        Returns all terms in this posting list
      • getFrequency

        public int getFrequency​(java.lang.String term)
        Return the frequency of the specified term in this document
      • clear

        public void clear()
        Removes all postings from this document
      • getDocumentLength

        public int getDocumentLength()
        Returns the total number of tokens in this document
      • getNumberOfPointers

        public int getNumberOfPointers()
        Returns the number of unique terms in this document.
      • insert

        public void insert​(java.lang.String term)
        Insert a term into the posting list of this document
        Parameters:
        term - the Term being inserted
      • insert

        public void insert​(int tf,
                           java.lang.String term)
        Insert a term into the posting list of this document
        Parameters:
        tf - frequency
        term - the Term being inserted
      • getDocumentStatistics

        public DocumentIndexEntry getDocumentStatistics()
        Return a DocumentIndexEntry for this document
      • forEachTerm

        public void forEachTerm​(gnu.trove.TObjectIntProcedure<java.lang.String> proc)
        Execute the specifed method for each term.
      • getPostings

        public int[][] getPostings​(TermCodes termCodes)
        Returns the postings suitable to be written into the direct index. During this, TermIds are assigned.
      • getPostings2

        public IterablePosting getPostings2​(TermCodes termCodes)
        Returns a posting iterator suitable to be written into the direct index. During this, TermIds are assigned, using getTermId() method.
      • makePostingIterator

        protected IterablePosting makePostingIterator​(java.lang.String[] _terms,
                                                      int[] termIds)
      • readFields

        public void readFields​(java.io.DataInput in)
                        throws java.io.IOException
        Specified by:
        readFields in interface org.apache.hadoop.io.Writable
        Throws:
        java.io.IOException
      • write

        public void write​(java.io.DataOutput out)
                   throws java.io.IOException
        Specified by:
        write in interface org.apache.hadoop.io.Writable
        Throws:
        java.io.IOException