Class BasicIndexer

  • Direct Known Subclasses:
    BasicSinglePassIndexer

    public class BasicIndexer
    extends Indexer
    BasicIndexer is the default indexer for Terrier. It takes terms from each Document object provided by the collection, and adds terms to temporary Lexicons, and into the DirectFile. The documentIndex is updated to give the pointers into the Direct file. The temporary lexicons are then merged into the main lexicon. Inverted Index construction takes place as a second step.
    Properties:
    Author:
    Craig Macdonald & Vassilis Plachouras
    See Also:
    Indexer, BlockIndexer
    • Field Detail

      • termFields

        protected java.util.Set<java.lang.String> termFields
        A private variable for storing the fields a term appears into.
      • termsInDocument

        protected DocumentPostingList termsInDocument
        The structure that holds the terms found in a document.
      • termCodes

        protected TermCodes termCodes
        Mapping of terms 2 termids
      • numOfTokensInDocument

        protected int numOfTokensInDocument
        The number of tokens found in the current document so far/
    • Constructor Detail

      • BasicIndexer

        protected BasicIndexer​(long a,
                               long b,
                               long c)
        Protected do-nothing constructor for use by child classes. Classes which use this method must call init()
      • BasicIndexer

        public BasicIndexer​(java.lang.String path,
                            java.lang.String prefix)
        Constructs an instance of a BasicIndexer, using the given path name for storing the data structures.
        Parameters:
        path - String the path where the data structures will be created. This is assumed to be absolute.
        prefix - String the filename component of the data structures
    • Method Detail

      • getEndOfPipeline

        protected TermPipeline getEndOfPipeline()
        Returns the end of the term pipeline, which corresponds to an instance of either BasicIndexer.BasicTermProcessor, or BasicIndexer.FieldTermProcessor, depending on whether field information is stored.
        Specified by:
        getEndOfPipeline in class Indexer
        Returns:
        TermPipeline the end of the term pipeline.
      • createDirectIndex

        public void createDirectIndex​(Collection[] collections)
        Creates the direct index, the document index and the lexicon. Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (eg stemming, stopping, lowercase).
        Specified by:
        createDirectIndex in class Indexer
        Parameters:
        collections - Collection[] the collections to be indexed.
      • indexDocument

        protected void indexDocument​(java.util.Map<java.lang.String,​java.lang.String> docProperties,
                                     DocumentPostingList _termsInDocument)
                              throws java.lang.Exception
        This adds a document to the direct and document indexes, as well as it's terms to the lexicon. Handled internally by the methods indexFieldDocument and indexNoFieldDocument.
        Parameters:
        docProperties - Map<String,String> properties of the document
        _termsInDocument - DocumentPostingList the terms in the document.
        Throws:
        java.lang.Exception
      • createInvertedIndex

        public void createInvertedIndex()
        Creates the inverted index after having created the direct index, document index and lexicon.
        Specified by:
        createInvertedIndex in class Indexer
      • createDocumentPostings

        protected void createDocumentPostings()
        Hook method that creates the right type of DocumentTree class.
      • finishedInvertedIndexBuild

        protected void finishedInvertedIndexBuild()
        Hook method, called when the inverted index is finished - ie the lexicon is finished
        Overrides:
        finishedInvertedIndexBuild in class Indexer