Package org.terrier.indexing
Class TRECFullTokenizer
- java.lang.Object
-
- org.terrier.indexing.TRECFullTokenizer
-
- All Implemented Interfaces:
Tokenizer
public class TRECFullTokenizer extends java.lang.Object implements Tokenizer
This class is the tokenizer used for indexing TREC topic files. It can be used for tokenizing other topic file formats, provided that the tags to skip and to process are specified accordingly.NB: This class only accepts A-Z a-z and 0-9 as valid character for query terms. If this restriction is too tight, please use TRECFullUTFTokenizer instead.
- Author:
- Gianni Amati, Vassilis Plachouras
- See Also:
TagSet
-
-
Field Summary
Fields Modifier and Type Field Description java.io.BufferedReaderbrThe input reader.longcounterThe number of bytes read from the input.booleanEODThe end of document.booleanEOFThe end of file from the buffered reader.booleanerrorA flag which is set when errors are encountered.protected TagSetexactTagSetThe set of exact tags.protected booleanignoreMissingClosingTagsAn option to ignore missing closing tags.booleaninDocnoTagIs in docno tag?booleaninTagToProcessIs in tag to process?booleaninTagToSkipIs in tag to skip?static intlastCharlast character readprotected static org.slf4j.Loggerloggerprotected static booleanlowercaseTransform to lowercase or not?.intnumber_of_termsA counter for the number of terms.protected static java.util.Stack<java.lang.String>stkThe stack where the tags are pushed and popped accordingly.protected java.lang.StringBuilderswprotected java.lang.StringBuildertagNameSBprotected TagSettagSetThe tag set to use.protected static inttokenMaximumLengthThe maximum length of a token in the check method.
-
Constructor Summary
Constructors Constructor Description TRECFullTokenizer()TConstructs an instance of the TRECFullTokenizer.TRECFullTokenizer(java.io.BufferedReader _br)Constructs an instance of the TRECFullTokenizer, given the buffered reader.TRECFullTokenizer(TagSet _tagSet, TagSet _exactSet)Constructs an instance of the TRECFullTokenizer with non-default tags.TRECFullTokenizer(TagSet _ts, TagSet _exactSet, java.io.BufferedReader _br)Constructs an instance of the TRECFullTokenizer with non-default tags and a given buffered reader.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected java.lang.Stringcheck(java.lang.String s)A restricted check function for discarding uncommon, or 'strange' terms.voidclose()Closes the buffered reader associated with the tokenizer.voidcloseBufferedReader()Closes the buffered reader associated with the tokenizer.java.lang.StringcurrentTag()Returns the name of the tag the tokenizer is currently in.longgetByteOffset()Returns the number of bytes read from the current file.booleaninDocnoTag()Indicates whether the tokenizer is in the special document number tag.booleaninTagToProcess()Returns true if the given tag is to be processed.booleaninTagToSkip()Returns true if the given tag is to be skipped.booleanisEndOfDocument()Returns true if the end of document is encountered.booleanisEndOfFile()Returns true if the end of file is encountered.voidnextDocument()Proceed to the next document.java.lang.StringnextToken()Returns the next token from the current chunk of text, extracted from the document into a TokenStream.protected voidprocessEndOfTag(java.lang.String tag)The encounterd tag, which must be a final tag is matched with the tag on the stack.voidsetIgnoreMissingClosingTags(boolean toIgnore)Sets the value of the ignoreMissingClosingTags.voidsetInput(java.io.BufferedReader _br)Sets the input of the tokenizer.
-
-
-
Field Detail
-
logger
protected static final org.slf4j.Logger logger
-
ignoreMissingClosingTags
protected boolean ignoreMissingClosingTags
An option to ignore missing closing tags. Used for the query files.
-
lastChar
public static int lastChar
last character read
-
number_of_terms
public int number_of_terms
A counter for the number of terms.
-
EOF
public boolean EOF
The end of file from the buffered reader.
-
EOD
public boolean EOD
The end of document.
-
error
public boolean error
A flag which is set when errors are encountered.
-
br
public java.io.BufferedReader br
The input reader.
-
counter
public long counter
The number of bytes read from the input.
-
stk
protected static java.util.Stack<java.lang.String> stk
The stack where the tags are pushed and popped accordingly.
-
tagSet
protected TagSet tagSet
The tag set to use.
-
exactTagSet
protected TagSet exactTagSet
The set of exact tags.
-
tokenMaximumLength
protected static final int tokenMaximumLength
The maximum length of a token in the check method.
-
lowercase
protected static final boolean lowercase
Transform to lowercase or not?.
-
inTagToProcess
public boolean inTagToProcess
Is in tag to process?
-
inTagToSkip
public boolean inTagToSkip
Is in tag to skip?
-
inDocnoTag
public boolean inDocnoTag
Is in docno tag?
-
sw
protected final java.lang.StringBuilder sw
-
tagNameSB
protected final java.lang.StringBuilder tagNameSB
-
-
Constructor Detail
-
TRECFullTokenizer
public TRECFullTokenizer()
TConstructs an instance of the TRECFullTokenizer. The used tags are TagSet.TREC_DOC_TAGS and TagSet.TREC_EXACT_DOC_TAGS
-
TRECFullTokenizer
public TRECFullTokenizer(java.io.BufferedReader _br)
Constructs an instance of the TRECFullTokenizer, given the buffered reader. The used tags are TagSet.TREC_DOC_TAGS and TagSet.TREC_EXACT_DOC_TAGS- Parameters:
_br- java.io.BufferedReader the input stream to tokenize
-
TRECFullTokenizer
public TRECFullTokenizer(TagSet _tagSet, TagSet _exactSet)
Constructs an instance of the TRECFullTokenizer with non-default tags.- Parameters:
_tagSet- TagSet the document tags to process._exactSet- TagSet the document tags to process exactly, without applying strict checks.
-
TRECFullTokenizer
public TRECFullTokenizer(TagSet _ts, TagSet _exactSet, java.io.BufferedReader _br)
Constructs an instance of the TRECFullTokenizer with non-default tags and a given buffered reader.- Parameters:
_ts- TagSet the document tags to process._exactSet- TagSet the document tags to process exactly, without applying strict checks._br- java.io.BufferedReader the input to tokenize.
-
-
Method Detail
-
check
protected java.lang.String check(java.lang.String s)
A restricted check function for discarding uncommon, or 'strange' terms.- Parameters:
s- The term to check.- Returns:
- the term if it passed the check, otherwise null.
-
close
public void close()
Closes the buffered reader associated with the tokenizer.
-
closeBufferedReader
public void closeBufferedReader()
Closes the buffered reader associated with the tokenizer.
-
currentTag
public java.lang.String currentTag()
Returns the name of the tag the tokenizer is currently in.- Specified by:
currentTagin interfaceTokenizer- Returns:
- the name of the tag the tokenizer is currently in
-
inDocnoTag
public boolean inDocnoTag()
Indicates whether the tokenizer is in the special document number tag.- Specified by:
inDocnoTagin interfaceTokenizer- Returns:
- true if the tokenizer is in the document number tag.
-
inTagToProcess
public boolean inTagToProcess()
Returns true if the given tag is to be processed.- Specified by:
inTagToProcessin interfaceTokenizer- Returns:
- true if the tag is to be processed, otherwise false.
-
inTagToSkip
public boolean inTagToSkip()
Returns true if the given tag is to be skipped.- Specified by:
inTagToSkipin interfaceTokenizer- Returns:
- true if the tag is to be skipped, otherwise false.
-
isEndOfDocument
public boolean isEndOfDocument()
Returns true if the end of document is encountered.- Specified by:
isEndOfDocumentin interfaceTokenizer- Returns:
- true if the end of document is encountered.
-
isEndOfFile
public boolean isEndOfFile()
Returns true if the end of file is encountered.- Specified by:
isEndOfFilein interfaceTokenizer- Returns:
- true if the end of file is encountered.
-
nextDocument
public void nextDocument()
Proceed to the next document.- Specified by:
nextDocumentin interfaceTokenizer
-
nextToken
public java.lang.String nextToken()
Returns the next token from the current chunk of text, extracted from the document into a TokenStream.
-
processEndOfTag
protected void processEndOfTag(java.lang.String tag)
The encounterd tag, which must be a final tag is matched with the tag on the stack. If they are not the same, then the consistency is restored by popping the tags in the stack, the observed tag included. If the stack becomes empty after that, then the end of document EOD is set to true.- Parameters:
tag- The closing tag to be tested against the content of the stack.
-
setIgnoreMissingClosingTags
public void setIgnoreMissingClosingTags(boolean toIgnore)
Sets the value of the ignoreMissingClosingTags.- Parameters:
toIgnore- boolean to ignore or not the missing closing tags
-
getByteOffset
public long getByteOffset()
Returns the number of bytes read from the current file.- Specified by:
getByteOffsetin interfaceTokenizer- Returns:
- long the byte offset
-
-