Class InvertedIndexBuilder

  • Direct Known Subclasses:
    BlockInvertedIndexBuilder

    public class InvertedIndexBuilder
    extends java.lang.Object
    Builds an inverted index. It optionally saves term-field information as well.

    Algorithm:

    1. While there are terms left:
      1. Read M term ids from lexicon, in lexicographical order
      2. Read the occurrences of these M terms into memory from the direct file
      3. Write the occurrences of these M terms to the inverted file
    2. Rewrite the lexicon, removing block frequencies, and adding inverted file offsets
    3. Write the collection statistics

    Lexicon term selection: There are two strategies of selecting the number of terms to read from the lexicon. The trade-off here is to read a small enough number of terms into memory such that the occurrences of all those terms from the direct file can fit in memory. On the other hand, the less terms that are read implies more iterations, which is I/O expensive, as the entire direct file has to be read for every iteration.
    The three strategies are:

    • Read until an assumed amount of memory is consumed. The amount to consume is upto 80% of the free memory at startup.
    • Read a fixed number of terms on each iterations - this corresponds to the property invertedfile.processterms
    • Read a fixed number of occurrences (pointers) on each iteration. The number of pointers can be determined using the sum of frequencies of each term from the lexicon. This corresponds to the property invertedfile.processpointers.
    By default, the 2nd strategy is chosen, unless the invertedfile.processpointers has a zero value specified.

    Properties:

    • invertedfile.processterms- the number of terms to process in each iteration. Defaults to 75,000
    • invertedfile.processpointers - the number of pointers to process in each iteration. Defaults to 20,000,000
    Author:
    Craig Macdonald & Vassilis Plachouras
    • Field Detail

      • DEFAULT_LEX_SCANNER_PROP_VALUE

        protected static final java.lang.String DEFAULT_LEX_SCANNER_PROP_VALUE
        See Also:
        Constant Field Values
      • lexiconOutputStream

        protected java.lang.Class<?> lexiconOutputStream
        class to be used as a lexiconoutpustream. set by this and child classes
      • logger

        protected static final org.slf4j.Logger logger
        The logger used
      • fieldCount

        protected int fieldCount
      • useFieldInformation

        protected boolean useFieldInformation
        Indicates whether field information is used.
      • structureName

        protected java.lang.String structureName
      • numberOfPointersPerIteration

        protected long numberOfPointersPerIteration
        The number of pointers to be processed in an interation. This directly corresponds to the property invertedfile.processpointers. If this property is set and > 0, then each iteration of the inverted index creation will be done to a set number of pointers, not a set number of terms, overriding invertedfile.processterms. Default is 20000000.
      • heapusage

        protected float heapusage
      • externalParalllism

        protected int externalParalllism
      • lexScanClassName

        protected java.lang.String lexScanClassName
      • processTerms

        protected int processTerms
        The number of terms for which the inverted file is built each time. The corresponding property is invertedfile.processterms and the default value is 75000. The higher the value, the greater the requirements for memory are, but the less time it takes to invert the direct file.
    • Method Detail

      • getExternalParalllism

        public int getExternalParalllism()
      • setExternalParalllism

        public void setExternalParalllism​(int externalParalllism)
      • getLexScanner

        protected org.terrier.structures.indexing.classical.InvertedIndexBuilder.LexiconScanner getLexScanner​(java.util.Iterator<java.util.Map.Entry<java.lang.String,​LexiconEntry>> lexStream,
                                                                                                              CollectionStatistics stats)
                                                                                                       throws java.lang.Exception
        Throws:
        java.lang.Exception
      • close

        public void close()
                   throws java.io.IOException
        Closes the underlying bit file.
        Throws:
        java.io.IOException
      • createInvertedIndex

        public void createInvertedIndex()
        Creates the inverted index using the already created direct index, document index and lexicon.
      • createPointerForTerm

        protected gnu.trove.TIntArrayList[] createPointerForTerm​(LexiconEntry le)
      • traverseDirectFile

        protected void traverseDirectFile​(gnu.trove.TIntIntHashMap codesHashMap,
                                          gnu.trove.TIntArrayList[][] tmpStorage)
                                   throws java.io.IOException
        Traverses the direct index and creates the inverted index entries for the terms specified in the codesHashMap and tmpStorage.
        Parameters:
        tmpStorage - TIntArrayList[][] an array of the inverted index entries to store
        codesHashMap - a mapping from the term identifiers to the index in the tmpStorage matrix.
        Throws:
        java.io.IOException - if there is a problem while traversing the direct index.
      • writeInvertedFilePart

        protected long[] writeInvertedFilePart​(java.io.DataOutputStream dos,
                                               gnu.trove.TIntArrayList[][] tmpStorage,
                                               int _processTerms)
                                        throws java.io.IOException
        Writes the section of the inverted file
        Parameters:
        dos - a temporary data structure that contains the offsets in the inverted index for each term.
        tmpStorage - Occurrences information, as described in traverseDirectFile(). This data is consumed by this method - once this method has been called, all the data in tmpStorage will be destroyed.
        _processTerms - The number of terms being processed in this iteration.
        Returns:
        the number of tokens processed in this iteration and the number of temporary bytes of RAM required
        Throws:
        java.io.IOException
      • displayMemoryUsage

        public static void displayMemoryUsage​(java.lang.Runtime r)
        display memory usage
        Parameters:
        r -
      • getLexOutputStream

        protected LexiconOutputStream<java.lang.String> getLexOutputStream​(java.lang.String _structureName)
                                                                    throws java.io.IOException
        get LexiconOutputStream
        Parameters:
        _structureName -
        Throws:
        java.io.IOException
      • main

        public static void main​(java.lang.String[] args)
                         throws java.lang.Exception
        utility method that allows creation of an inverted index from a direct index
        Throws:
        java.lang.Exception