Class TRECFullTokenizer

  • All Implemented Interfaces:
    Tokenizer

    public class TRECFullTokenizer
    extends java.lang.Object
    implements Tokenizer
    This class is the tokenizer used for indexing TREC topic files. It can be used for tokenizing other topic file formats, provided that the tags to skip and to process are specified accordingly.

    NB: This class only accepts A-Z a-z and 0-9 as valid character for query terms. If this restriction is too tight, please use TRECFullUTFTokenizer instead.

    Author:
    Gianni Amati, Vassilis Plachouras
    See Also:
    TagSet
    • Field Summary

      Fields 
      Modifier and Type Field Description
      java.io.BufferedReader br
      The input reader.
      long counter
      The number of bytes read from the input.
      boolean EOD
      The end of document.
      boolean EOF
      The end of file from the buffered reader.
      boolean error
      A flag which is set when errors are encountered.
      protected TagSet exactTagSet
      The set of exact tags.
      protected boolean ignoreMissingClosingTags
      An option to ignore missing closing tags.
      boolean inDocnoTag
      Is in docno tag?
      boolean inTagToProcess
      Is in tag to process?
      boolean inTagToSkip
      Is in tag to skip?
      static int lastChar
      last character read
      protected static org.slf4j.Logger logger  
      protected static boolean lowercase
      Transform to lowercase or not?.
      int number_of_terms
      A counter for the number of terms.
      protected static java.util.Stack<java.lang.String> stk
      The stack where the tags are pushed and popped accordingly.
      protected java.lang.StringBuilder sw  
      protected java.lang.StringBuilder tagNameSB  
      protected TagSet tagSet
      The tag set to use.
      protected static int tokenMaximumLength
      The maximum length of a token in the check method.
    • Constructor Summary

      Constructors 
      Constructor Description
      TRECFullTokenizer()
      TConstructs an instance of the TRECFullTokenizer.
      TRECFullTokenizer​(java.io.BufferedReader _br)
      Constructs an instance of the TRECFullTokenizer, given the buffered reader.
      TRECFullTokenizer​(TagSet _tagSet, TagSet _exactSet)
      Constructs an instance of the TRECFullTokenizer with non-default tags.
      TRECFullTokenizer​(TagSet _ts, TagSet _exactSet, java.io.BufferedReader _br)
      Constructs an instance of the TRECFullTokenizer with non-default tags and a given buffered reader.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected java.lang.String check​(java.lang.String s)
      A restricted check function for discarding uncommon, or 'strange' terms.
      void close()
      Closes the buffered reader associated with the tokenizer.
      void closeBufferedReader()
      Closes the buffered reader associated with the tokenizer.
      java.lang.String currentTag()
      Returns the name of the tag the tokenizer is currently in.
      long getByteOffset()
      Returns the number of bytes read from the current file.
      boolean inDocnoTag()
      Indicates whether the tokenizer is in the special document number tag.
      boolean inTagToProcess()
      Returns true if the given tag is to be processed.
      boolean inTagToSkip()
      Returns true if the given tag is to be skipped.
      boolean isEndOfDocument()
      Returns true if the end of document is encountered.
      boolean isEndOfFile()
      Returns true if the end of file is encountered.
      void nextDocument()
      Proceed to the next document.
      java.lang.String nextToken()
      Returns the next token from the current chunk of text, extracted from the document into a TokenStream.
      protected void processEndOfTag​(java.lang.String tag)
      The encounterd tag, which must be a final tag is matched with the tag on the stack.
      void setIgnoreMissingClosingTags​(boolean toIgnore)
      Sets the value of the ignoreMissingClosingTags.
      void setInput​(java.io.BufferedReader _br)
      Sets the input of the tokenizer.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • logger

        protected static final org.slf4j.Logger logger
      • ignoreMissingClosingTags

        protected boolean ignoreMissingClosingTags
        An option to ignore missing closing tags. Used for the query files.
      • lastChar

        public static int lastChar
        last character read
      • number_of_terms

        public int number_of_terms
        A counter for the number of terms.
      • EOF

        public boolean EOF
        The end of file from the buffered reader.
      • EOD

        public boolean EOD
        The end of document.
      • error

        public boolean error
        A flag which is set when errors are encountered.
      • br

        public java.io.BufferedReader br
        The input reader.
      • counter

        public long counter
        The number of bytes read from the input.
      • stk

        protected static java.util.Stack<java.lang.String> stk
        The stack where the tags are pushed and popped accordingly.
      • tagSet

        protected TagSet tagSet
        The tag set to use.
      • exactTagSet

        protected TagSet exactTagSet
        The set of exact tags.
      • tokenMaximumLength

        protected static final int tokenMaximumLength
        The maximum length of a token in the check method.
      • lowercase

        protected static final boolean lowercase
        Transform to lowercase or not?.
      • inTagToProcess

        public boolean inTagToProcess
        Is in tag to process?
      • inTagToSkip

        public boolean inTagToSkip
        Is in tag to skip?
      • inDocnoTag

        public boolean inDocnoTag
        Is in docno tag?
      • sw

        protected final java.lang.StringBuilder sw
      • tagNameSB

        protected final java.lang.StringBuilder tagNameSB
    • Constructor Detail

      • TRECFullTokenizer

        public TRECFullTokenizer()
        TConstructs an instance of the TRECFullTokenizer. The used tags are TagSet.TREC_DOC_TAGS and TagSet.TREC_EXACT_DOC_TAGS
      • TRECFullTokenizer

        public TRECFullTokenizer​(java.io.BufferedReader _br)
        Constructs an instance of the TRECFullTokenizer, given the buffered reader. The used tags are TagSet.TREC_DOC_TAGS and TagSet.TREC_EXACT_DOC_TAGS
        Parameters:
        _br - java.io.BufferedReader the input stream to tokenize
      • TRECFullTokenizer

        public TRECFullTokenizer​(TagSet _tagSet,
                                 TagSet _exactSet)
        Constructs an instance of the TRECFullTokenizer with non-default tags.
        Parameters:
        _tagSet - TagSet the document tags to process.
        _exactSet - TagSet the document tags to process exactly, without applying strict checks.
      • TRECFullTokenizer

        public TRECFullTokenizer​(TagSet _ts,
                                 TagSet _exactSet,
                                 java.io.BufferedReader _br)
        Constructs an instance of the TRECFullTokenizer with non-default tags and a given buffered reader.
        Parameters:
        _ts - TagSet the document tags to process.
        _exactSet - TagSet the document tags to process exactly, without applying strict checks.
        _br - java.io.BufferedReader the input to tokenize.
    • Method Detail

      • check

        protected java.lang.String check​(java.lang.String s)
        A restricted check function for discarding uncommon, or 'strange' terms.
        Parameters:
        s - The term to check.
        Returns:
        the term if it passed the check, otherwise null.
      • close

        public void close()
        Closes the buffered reader associated with the tokenizer.
      • closeBufferedReader

        public void closeBufferedReader()
        Closes the buffered reader associated with the tokenizer.
      • currentTag

        public java.lang.String currentTag()
        Returns the name of the tag the tokenizer is currently in.
        Specified by:
        currentTag in interface Tokenizer
        Returns:
        the name of the tag the tokenizer is currently in
      • inDocnoTag

        public boolean inDocnoTag()
        Indicates whether the tokenizer is in the special document number tag.
        Specified by:
        inDocnoTag in interface Tokenizer
        Returns:
        true if the tokenizer is in the document number tag.
      • inTagToProcess

        public boolean inTagToProcess()
        Returns true if the given tag is to be processed.
        Specified by:
        inTagToProcess in interface Tokenizer
        Returns:
        true if the tag is to be processed, otherwise false.
      • inTagToSkip

        public boolean inTagToSkip()
        Returns true if the given tag is to be skipped.
        Specified by:
        inTagToSkip in interface Tokenizer
        Returns:
        true if the tag is to be skipped, otherwise false.
      • isEndOfDocument

        public boolean isEndOfDocument()
        Returns true if the end of document is encountered.
        Specified by:
        isEndOfDocument in interface Tokenizer
        Returns:
        true if the end of document is encountered.
      • isEndOfFile

        public boolean isEndOfFile()
        Returns true if the end of file is encountered.
        Specified by:
        isEndOfFile in interface Tokenizer
        Returns:
        true if the end of file is encountered.
      • nextDocument

        public void nextDocument()
        Proceed to the next document.
        Specified by:
        nextDocument in interface Tokenizer
      • nextToken

        public java.lang.String nextToken()
        Returns the next token from the current chunk of text, extracted from the document into a TokenStream.
        Specified by:
        nextToken in interface Tokenizer
        Returns:
        String the next token of the document, or null if the token was discarded during tokenisation.
      • processEndOfTag

        protected void processEndOfTag​(java.lang.String tag)
        The encounterd tag, which must be a final tag is matched with the tag on the stack. If they are not the same, then the consistency is restored by popping the tags in the stack, the observed tag included. If the stack becomes empty after that, then the end of document EOD is set to true.
        Parameters:
        tag - The closing tag to be tested against the content of the stack.
      • setIgnoreMissingClosingTags

        public void setIgnoreMissingClosingTags​(boolean toIgnore)
        Sets the value of the ignoreMissingClosingTags.
        Parameters:
        toIgnore - boolean to ignore or not the missing closing tags
      • getByteOffset

        public long getByteOffset()
        Returns the number of bytes read from the current file.
        Specified by:
        getByteOffset in interface Tokenizer
        Returns:
        long the byte offset
      • setInput

        public void setInput​(java.io.BufferedReader _br)
        Sets the input of the tokenizer.
        Specified by:
        setInput in interface Tokenizer
        Parameters:
        _br - BufferedReader the input stream