Class ExtensibleSinglePassIndexer
- java.lang.Object
-
- org.terrier.structures.indexing.Indexer
-
- org.terrier.structures.indexing.classical.BasicIndexer
-
- org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer
-
- org.terrier.structures.indexing.singlepass.ExtensibleSinglePassIndexer
-
public abstract class ExtensibleSinglePassIndexer extends BasicSinglePassIndexer
Directly based on BasicSinglePassIndexer, with just a few modifications to enable some extra hooks.- Author:
- Roi Blanco, Jonathon Hare [jsh2{a.}ecs.soton.ac.uk]
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.terrier.structures.indexing.classical.BasicIndexer
BasicIndexer.BasicTermProcessor, BasicIndexer.FieldTermProcessor
-
-
Field Summary
Fields Modifier and Type Field Description protected SinglePassIndexerFlushDelegateflushDelegateDelegate for HadoopIndexerMapper to intercept flushes-
Fields inherited from class org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer
basicInvertedIndexPostingIteratorClass, currentFile, currentId, docsPerCheck, fieldInvertedIndexPostingIteratorClass, fileNames, invertedIndexClass, invertedIndexInputStreamClass, maxDocsPerFlush, maxMemory, memoryAfterFlush, memoryCheck, merger, mp, numberOfDocsSinceCheck, numberOfDocsSinceFlush, numberOfDocuments, numberOfPointers, numberOfTokens, numberOfUniqueTerms, runtime
-
Fields inherited from class org.terrier.structures.indexing.classical.BasicIndexer
compressionDirectConfig, compressionInvertedConfig, numOfTokensInDocument, termCodes, termFields, termsInDocument
-
Fields inherited from class org.terrier.structures.indexing.Indexer
blocks, BUILDER_BOUNDARY_DOCUMENTS, currentIndex, directIndexBuilder, docIndexBuilder, emptyDocCount, emptyDocIndexEntry, externalParalllism, fieldNames, fileNameNoExtension, IndexEmptyDocuments, invertedIndexBuilder, lexiconBuilder, logger, MAX_DOCS_PER_BUILDER, MAX_TOKENS_IN_DOCUMENT, metaBuilder, numFields, path, pipeline_first, prefix, useFieldInformation
-
-
Constructor Summary
Constructors Constructor Description ExtensibleSinglePassIndexer(java.lang.String pathname, java.lang.String prefix)Default constructor
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected abstract voidcreateDocumentPostings()Hook method that creates the right type of DocumentTree class.voidcreateInvertedIndex(Collection[] collections)Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (e.g.protected abstract voidcreateMemoryPostings()Hook method that creates the right type of MemoryPostings class.protected voidcreateRunMerger(java.lang.String[][] files)Hook method that creates a RunsMerger instanceprotected voidforceFlush()Force the indexer to flush everything and free memory.IndexgetCurrentIndex()Get the index currently being constructed by this indexer.protected abstract TermPipelinegetEndOfPipeline()Returns the end of the term pipeline, which corresponds to an instance of either BasicIndexer.BasicTermProcessor, or BasicIndexer.FieldTermProcessor, depending on whether field information is stored.protected SinglePassIndexerFlushDelegategetFlushDelegate()Get the flushDelegateprotected abstract java.lang.Class<? extends org.terrier.structures.indexing.singlepass.PostingInRun>getPostingInRunClass()Get the class for storing postings in runs.protected abstract voidpreProcess(Document doc, java.lang.String term)Perform an operation before the term pipeline is initiated.protected voidsetFlushDelegate(SinglePassIndexerFlushDelegate _flushDelegate)Set the flushDelegate-
Methods inherited from class org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer
checkFlush, createDirectIndex, createFieldRunMerger, createInvertedIndex, finishMemoryPosting, getFileNames, indexDocument, load_indexer_properties, performMultiWayMerge
-
Methods inherited from class org.terrier.structures.indexing.classical.BasicIndexer
finishedInvertedIndexBuild
-
Methods inherited from class org.terrier.structures.indexing.Indexer
createMetaIndexBuilder, finishedDirectIndexBuild, getExternalParalllism, index, indexEmpty, init, load_builder_boundary_documents, load_field_ids, load_pipeline, main, merge, merge, mergeTwoIndices, parseInts, setExternalParalllism, useFieldInformation
-
-
-
-
Field Detail
-
flushDelegate
protected SinglePassIndexerFlushDelegate flushDelegate
Delegate for HadoopIndexerMapper to intercept flushes
-
-
Constructor Detail
-
ExtensibleSinglePassIndexer
public ExtensibleSinglePassIndexer(java.lang.String pathname, java.lang.String prefix)Default constructor- Parameters:
pathname- String the path where the datastructures will be created. This is assumed to be absolute.prefix- String the prefix of the index, usually "data".
-
-
Method Detail
-
getEndOfPipeline
protected abstract TermPipeline getEndOfPipeline()
Returns the end of the term pipeline, which corresponds to an instance of either BasicIndexer.BasicTermProcessor, or BasicIndexer.FieldTermProcessor, depending on whether field information is stored.- Overrides:
getEndOfPipelinein classBasicIndexer- Returns:
- TermPipeline the end of the term pipeline.
-
getPostingInRunClass
protected abstract java.lang.Class<? extends org.terrier.structures.indexing.singlepass.PostingInRun> getPostingInRunClass()
Get the class for storing postings in runs.- Returns:
- PostingInRun Subclass of PostingInRun for this indexer
-
createRunMerger
protected void createRunMerger(java.lang.String[][] files) throws java.lang.ExceptionHook method that creates a RunsMerger instance- Overrides:
createRunMergerin classBasicSinglePassIndexer- Throws:
java.io.IOException- if an I/O error occurs.java.lang.Exception
-
createMemoryPostings
protected abstract void createMemoryPostings()
Hook method that creates the right type of MemoryPostings class.- Overrides:
createMemoryPostingsin classBasicSinglePassIndexer
-
createDocumentPostings
protected abstract void createDocumentPostings()
Hook method that creates the right type of DocumentTree class.- Overrides:
createDocumentPostingsin classBasicIndexer
-
createInvertedIndex
public void createInvertedIndex(Collection[] collections)
Builds the inverted file and lexicon file for the given collections Loops through each document in each of the collections, extracting terms and pushing these through the Term Pipeline (e.g. stemming, stopping, lowercase, etc.). Only one thing is modified from BasicSinglePassIndexer - I've added a pre-processing operation before each term is passed to the pipeline- Overrides:
createInvertedIndexin classBasicSinglePassIndexer- Parameters:
collections- Collection[] the collections to be indexed.
-
preProcess
protected abstract void preProcess(Document doc, java.lang.String term)
Perform an operation before the term pipeline is initiated. This could for example extract data and store in a field that the pipeline could access- Parameters:
doc- Current documentterm- Current term
-
getCurrentIndex
public Index getCurrentIndex()
Get the index currently being constructed by this indexer. This might be null if indexing hasn't commenced yet. It is useful for adding extra properties, etc to the index after indexing is finished.- Returns:
- the current index
-
setFlushDelegate
protected void setFlushDelegate(SinglePassIndexerFlushDelegate _flushDelegate)
Set the flushDelegate- Parameters:
_flushDelegate-
-
getFlushDelegate
protected SinglePassIndexerFlushDelegate getFlushDelegate()
Get the flushDelegate- Returns:
- the flushDelegate
-
forceFlush
protected void forceFlush() throws java.io.IOExceptionForce the indexer to flush everything and free memory. Either calls the super method, or passes to a delegate if the flushDelegate is set.- Overrides:
forceFlushin classBasicSinglePassIndexer- Throws:
java.io.IOException- See Also:
BasicSinglePassIndexer.forceFlush()
-
-