Class BlockIndexer


  • public class BlockIndexer
    extends Indexer
    An indexer that saves block information for the indexed terms. Block information is usually recorded in terms of relative term positions (position 1, positions 2, etc), however, since 2.2, Terrier supports the presence of "marker terms" during indexing which are used to increment the block counter.

    Properties:

    • blocks.size - How many terms should be in one block. If you want to use phrasal search, this need to be 1 (default).
    • blocks.max - Maximum number of blocks in a document. After this number of blocks, all subsequent terms will be in the same block. Default 100,000
    • block.indexing - This class should only be used if the block.indexing property is set.
    • indexing.max.encoded.documentindex.docs - how many docs before the DocumentIndexEncoded is dropped in favour of the DocumentIndex (on disk implementation).
    • See Also: Properties in org.terrier.indexing.Indexer and org.terrier.indexing.BasicIndexer

    Markered Blocks
    Markers are terms (artificially inserted or otherwise into the term stream that are used to denote when the block counter should be incremented. This functionality is enabled using the block.delimiters.enabled property, while the terms are specified using a comma delimited fashion with the block.delimiters property. The following lists the properties:

    • block.delimiters.enabled - enabled markered blocks. Defaults to false, set to true to enable.
    • block.delimiters - comma delimited list of terms that are markers. Defaults to empty. Terms are lowercased is lowercase is set to true (default).
    • block.delimiters.index.terms - set to true if markers terms should actually be indexed. Defaults to false.
    • block.delimiters.index.doclength - set to true if markers terms should contribute to document length. Defaults to false, only has effect if block.delimiters.index.terms is set.
    Author:
    Craig Macdonald, Vassilis Plachouras, Rodrygo Santos
    • Field Detail

      • numOfTokensInDocument

        protected int numOfTokensInDocument
        The number of tokens in the current document so far.
      • numOfTokensInBlock

        protected int numOfTokensInBlock
        The number of tokens in the current block of the current document.
      • blockId

        protected int blockId
        The block number of the current document.
      • termFields

        protected java.util.Set<java.lang.String> termFields
        The fields that are set for the current term.
      • termsInDocument

        protected DocumentPostingList termsInDocument
        The list of terms in this document, and for each, the block occurrences.
      • termCodes

        protected TermCodes termCodes
        Mapping of terms 2 termids
      • BLOCK_SIZE

        protected int BLOCK_SIZE
        The maximum number of terms allowed in a block. See Property blocks.size
      • MAX_BLOCKS

        protected int MAX_BLOCKS
        The maximum number allowed number of blocks in a document. After this value, all the remaining terms are in the final block. See Property blocks.max.
    • Constructor Detail

      • BlockIndexer

        public BlockIndexer​(java.lang.String pathname,
                            java.lang.String prefix)
        Constructs an instance of this class, where the created data structures are stored in the given path, with the given prefix on the filenames.
        Parameters:
        pathname - String the path in which the created data structures will be saved. This is assumed to be absolute.
        prefix - String the prefix on the filenames of the created data structures, usually "data"
    • Method Detail

      • getEndOfPipeline

        protected TermPipeline getEndOfPipeline()
        Returns the object that is to be the end of the TermPipeline. This method is used at construction time of the parent object.
        Specified by:
        getEndOfPipeline in class Indexer
        Returns:
        TermPipeline the last component of the term pipeline.
      • indexDocument

        protected void indexDocument​(java.util.Map<java.lang.String,​java.lang.String> docProperties,
                                     DocumentPostingList _termsInDocument)
                              throws java.lang.Exception
        This adds a document to the direct and document indexes, as well as it's terms to the lexicon. Handled internally by the methods indexFieldDocument and indexNoFieldDocument.
        Parameters:
        docProperties - Map<String,String> properties of the document
        _termsInDocument - DocumentPostingList the terms in the document.
        Throws:
        java.lang.Exception
      • createInvertedIndex

        public void createInvertedIndex()
        Creates the inverted index from the already created direct index, document index and lexicon. It saves block information and possibly field information as well.
        Specified by:
        createInvertedIndex in class Indexer
        See Also:
        Indexer.createInvertedIndex()
      • finishedInvertedIndexBuild

        protected void finishedInvertedIndexBuild()
        Hook method, called when the inverted index is finished - ie the lexicon is finished
        Overrides:
        finishedInvertedIndexBuild in class Indexer
      • createDocumentPostings

        protected void createDocumentPostings()