Package org.terrier.structures.indexing
Class LexiconBuilder
- java.lang.Object
-
- org.terrier.structures.indexing.LexiconBuilder
-
public class LexiconBuilder extends java.lang.ObjectBuilds temporary lexicons during indexing a collection and merges them when the indexing of a collection has finished.- Author:
- Craig Macdonald & Vassilis Plachouras
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classLexiconBuilder.BasicLexiconCollectionStaticticsCountercounts global statistics in the non-fields casestatic interfaceLexiconBuilder.CollectionStatisticsCounterCounter of LexiconEntriesprotected static classLexiconBuilder.FieldLexiconCollectionStaticticsCountercounts global statistics in the fields caseprotected static classLexiconBuilder.NullCollectionStatisticsCounter
-
Field Summary
Fields Modifier and Type Field Description protected java.lang.StringdefaultStructureNameprotected intDocCountHow many documents have been processed so far.protected static intDocumentsPerLexiconThe number of documents for which a temporary lexicon is created.protected IndexOnDiskindexprotected java.lang.StringindexPathThe directory to write the final lexicons toprotected java.lang.StringindexPrefixThe filename of the lexicons.protected java.lang.StringlexiconEntryFactoryValueClassprotected java.lang.Class<? extends LexiconOutputStream>lexiconOutputStreamclass to be used as a lexiconoutpustream.protected static org.slf4j.LoggerloggerThe logger used for this classprotected static intMAXLEXMERGENumber of lexicons to merge at once.protected MemoryCheckermemCheckprotected static booleanMERGE2LEXATTIMEShould we only merge lexicons in pairs (Terrier 1.0.x scheme)? Set by property lexicon.builder.merge.2lex.attimeprotected LexiconMapTempLexThe lexicontree to write the current term stream toprotected intTempLexCountHow many temporary lexicons have been generated so farprotected java.util.LinkedList<java.lang.String>tempLexFilesThe list in which the temporary lexicon structure names are stored.protected TermCodestermCodesprotected intTermCountHow many terms are in the final lexiconprotected FixedSizeWriteableFactory<LexiconEntry>valueFactory
-
Constructor Summary
Constructors Constructor Description LexiconBuilder(IndexOnDisk i, java.lang.String _structureName, java.lang.Class<? extends LexiconMap> _LexiconMapClass, java.lang.String _lexiconEntryClass, TermCodes termCodes)constructorLexiconBuilder(IndexOnDisk i, java.lang.String _structureName, LexiconMap lexiconMap, java.lang.String _lexiconEntryClass, java.lang.String valueFactoryParamTypes, java.lang.String valueFactoryParamValues, TermCodes _termCodes)constructorLexiconBuilder(IndexOnDisk i, java.lang.String _structureName, LexiconMap lexiconMap, java.lang.String _lexiconEntryClass, TermCodes termCodes)constructorLexiconBuilder(IndexOnDisk i, java.lang.String _structureName, TermCodes tc)constructor
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description voidaddDocumentTerms(DocumentPostingList terms)adds the terms of a document to the temporary lexicon in memory.voidaddTemporaryLexicon(java.lang.String structureName)Deprecated.voidaddTerm(java.lang.String term, int tf)Add a single term to the lexicon being builtstatic voidcreateLexiconHash(IndexOnDisk index)Deprecated.use optimise insteadstatic voidcreateLexiconIndex(IndexOnDisk index)Deprecated.use optimise insteadvoidfinishedDirectIndexBuild()Processing the lexicon after finished creating the direct and document indexes.voidfinishedInvertedIndexBuild()Processing the lexicon after finished creating the inverted index.voidflush()Force a temporary lexicon to be flushedintgetFinalNumberOfTerms()Returns the number of terms in the final lexicon.protected java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>>getLexInputStream(java.lang.String structureName)return the lexicon input stream for the current index at the specified filenameprotected LexiconOutputStream<java.lang.String>getLexOutputStream(java.lang.String structureName)return the lexicon outputstream for the current index at the specified filenameprotected static LexiconMapinstantiate(java.lang.Class<? extends LexiconMap> LexiconMapClass)voidmerge(java.util.LinkedList<java.lang.String> filesToMerge)Merges the intermediate lexicon files created during the indexing.protected voidmergeNLexicons(java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>>[] lis, LexiconOutputStream<java.lang.String> los)protected voidmergeTwoLexicons(java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lis1, java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lis2, LexiconOutputStream<java.lang.String> los)Merge the two LexiconInputStreams into the given LexiconOutputStreamprotected LexiconEntrynewLexiconEntry(int termid)static voidoptimise(IndexOnDisk index, java.lang.String structureName)Optimises the lexicon, eg lexid filevoidoptimiseLexicon()optimise the lexiconstatic voidreAssignTermIds(IndexOnDisk index, java.lang.String structureName, int numEntries)Re-assigned the termids within the named lexicon structure to be ascending with descending term frequency, i.e.protected voidwriteTemporaryLexicon()Writes the current contents of TempLex temporary lexicon binary tree down to a temporary disk lexicon.
-
-
-
Field Detail
-
lexiconOutputStream
protected java.lang.Class<? extends LexiconOutputStream> lexiconOutputStream
class to be used as a lexiconoutpustream. set by this and child classes
-
lexiconEntryFactoryValueClass
protected final java.lang.String lexiconEntryFactoryValueClass
-
logger
protected static final org.slf4j.Logger logger
The logger used for this class
-
DocCount
protected int DocCount
How many documents have been processed so far.
-
TermCount
protected int TermCount
How many terms are in the final lexicon
-
DocumentsPerLexicon
protected static final int DocumentsPerLexicon
The number of documents for which a temporary lexicon is created. Corresponds to property bundle.size, default value 2000.
-
tempLexFiles
protected final java.util.LinkedList<java.lang.String> tempLexFiles
The list in which the temporary lexicon structure names are stored. These are merged into a single Lexicon by the merge() method. LinkedList is best List implementation for this, as all operations are either append element, or remove first element - making LinkedList ideal.
-
TempLex
protected LexiconMap TempLex
The lexicontree to write the current term stream to
-
termCodes
protected TermCodes termCodes
-
indexPath
protected java.lang.String indexPath
The directory to write the final lexicons to
-
indexPrefix
protected java.lang.String indexPrefix
The filename of the lexicons.
-
index
protected IndexOnDisk index
-
TempLexCount
protected int TempLexCount
How many temporary lexicons have been generated so far
-
MERGE2LEXATTIME
protected static final boolean MERGE2LEXATTIME
Should we only merge lexicons in pairs (Terrier 1.0.x scheme)? Set by property lexicon.builder.merge.2lex.attime
-
MAXLEXMERGE
protected static final int MAXLEXMERGE
Number of lexicons to merge at once. Set by property lexicon.builder.merge.lex.max, defaults to 16
-
defaultStructureName
protected java.lang.String defaultStructureName
-
valueFactory
protected FixedSizeWriteableFactory<LexiconEntry> valueFactory
-
memCheck
protected MemoryChecker memCheck
-
-
Constructor Detail
-
LexiconBuilder
public LexiconBuilder(IndexOnDisk i, java.lang.String _structureName, TermCodes tc)
constructor- Parameters:
i-_structureName-
-
LexiconBuilder
public LexiconBuilder(IndexOnDisk i, java.lang.String _structureName, java.lang.Class<? extends LexiconMap> _LexiconMapClass, java.lang.String _lexiconEntryClass, TermCodes termCodes)
constructor- Parameters:
i-_structureName-_LexiconMapClass-_lexiconEntryClass-
-
LexiconBuilder
public LexiconBuilder(IndexOnDisk i, java.lang.String _structureName, LexiconMap lexiconMap, java.lang.String _lexiconEntryClass, TermCodes termCodes)
constructor- Parameters:
i-_structureName-lexiconMap-_lexiconEntryClass-
-
LexiconBuilder
public LexiconBuilder(IndexOnDisk i, java.lang.String _structureName, LexiconMap lexiconMap, java.lang.String _lexiconEntryClass, java.lang.String valueFactoryParamTypes, java.lang.String valueFactoryParamValues, TermCodes _termCodes)
constructor- Parameters:
i-_structureName-lexiconMap-_lexiconEntryClass-valueFactoryParamTypes-valueFactoryParamValues-
-
-
Method Detail
-
instantiate
protected static LexiconMap instantiate(java.lang.Class<? extends LexiconMap> LexiconMapClass)
-
getFinalNumberOfTerms
public int getFinalNumberOfTerms()
Returns the number of terms in the final lexicon. Only updated once finishDirectIndexBuild() has executed
-
addTemporaryLexicon
public void addTemporaryLexicon(java.lang.String structureName)
Deprecated.If the application code generated lexicons itself, use this method to add them to the merge list Otherwise dont touch this method.- Parameters:
structureName- Fully path to a lexicon to merge
-
writeTemporaryLexicon
protected void writeTemporaryLexicon()
Writes the current contents of TempLex temporary lexicon binary tree down to a temporary disk lexicon.
-
addTerm
public void addTerm(java.lang.String term, int tf)Add a single term to the lexicon being built- Parameters:
term- The String termtf- the frequency of the term
-
addDocumentTerms
public void addDocumentTerms(DocumentPostingList terms)
adds the terms of a document to the temporary lexicon in memory.- Parameters:
terms- DocumentPostingList the terms of the document to add to the temporary lexicon
-
flush
public void flush()
Force a temporary lexicon to be flushed
-
finishedInvertedIndexBuild
public void finishedInvertedIndexBuild()
Processing the lexicon after finished creating the inverted index.
-
finishedDirectIndexBuild
public void finishedDirectIndexBuild()
Processing the lexicon after finished creating the direct and document indexes.
-
merge
public void merge(java.util.LinkedList<java.lang.String> filesToMerge) throws java.io.IOExceptionMerges the intermediate lexicon files created during the indexing.- Parameters:
filesToMerge- java.util.LinkedList the list containing the filenames of the temporary files.- Throws:
java.io.IOException- an input/output exception is throws if a problem is encountered.
-
newLexiconEntry
protected LexiconEntry newLexiconEntry(int termid)
-
mergeNLexicons
protected void mergeNLexicons(java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>>[] lis, LexiconOutputStream<java.lang.String> los) throws java.io.IOException
- Throws:
java.io.IOException
-
mergeTwoLexicons
protected void mergeTwoLexicons(java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lis1, java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> lis2, LexiconOutputStream<java.lang.String> los) throws java.io.IOException
Merge the two LexiconInputStreams into the given LexiconOutputStream- Parameters:
lis1- First lexicon to be mergedlis2- Second lexicon to be mergedlos- Lexion to be merged to- Throws:
java.io.IOException
-
createLexiconIndex
public static void createLexiconIndex(IndexOnDisk index) throws java.io.IOException
Deprecated.use optimise insteadCreates a lexicon index for the specified index- Parameters:
index- IndexOnDisk to make the lexicon index for- Throws:
java.io.IOException
-
createLexiconHash
public static void createLexiconHash(IndexOnDisk index) throws java.io.IOException
Deprecated.use optimise insteadCreates a lexicon hash for the specified index- Parameters:
index- IndexOnDisk to make the LexiconHash the lexicoin- Throws:
java.io.IOException
-
optimiseLexicon
public void optimiseLexicon()
optimise the lexicon
-
optimise
public static void optimise(IndexOnDisk index, java.lang.String structureName)
Optimises the lexicon, eg lexid file
-
reAssignTermIds
public static void reAssignTermIds(IndexOnDisk index, java.lang.String structureName, int numEntries) throws java.io.IOException
Re-assigned the termids within the named lexicon structure to be ascending with descending term frequency, i.e. the terms with termid 0 will have the highest frequency.- Parameters:
index-structureName-numEntries-- Throws:
java.io.IOException
-
getLexInputStream
protected java.util.Iterator<java.util.Map.Entry<java.lang.String,LexiconEntry>> getLexInputStream(java.lang.String structureName) throws java.io.IOException
return the lexicon input stream for the current index at the specified filename- Throws:
java.io.IOException
-
getLexOutputStream
protected LexiconOutputStream<java.lang.String> getLexOutputStream(java.lang.String structureName) throws java.io.IOException
return the lexicon outputstream for the current index at the specified filename- Throws:
java.io.IOException
-
-