Class UTFTokeniser

  • All Implemented Interfaces:
    java.io.Serializable

    public class UTFTokeniser
    extends Tokeniser
    Tokenises text obtained from a text stream. In contrast to EnglishTokeniser, a more liberal tokenisation is performed. In particular, an acceptable character for any token must match one of three rules:
    1. Character.isLetterOrDigit() returns true
    2. Character.getType() returns Character.NON_SPACING_MARK
    3. Character.getType() returns Character.COMBINING_SPACING_MARK
    All other characters cause a new token.

    Furthermore, there is an additional checking of terms, to reduce index noise, as follows:

    1. Any term which is longer than max.term.length (usually 20 characters) is discarded.
    2. Any term which has more than 4 digits is discarded.
    3. Any term which has more than 3 consecutive identical characters are discarded.
    Properties:
    • lowercase - should all terms be lowercased or not?
    • max.term.length - maximum acceptable term length, default is 20.
    Author:
    Gianni Amati, Ben He, Vassilis Plachouras, Craig Macdonald
    See Also:
    EnglishTokeniser, Character, Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected static boolean DROP_LONG_TOKENS
      Whether tokens longer than MAX_TERM_LENGTH should be dropped.
      protected static int maxNumOfDigitsPerTerm
      The maximum number of digits that are allowed in valid terms.
      protected static int maxNumOfSameConseqLettersPerTerm
      The maximum number of consecutive same letters or digits that are allowed in valid terms.
    • Constructor Summary

      Constructors 
      Constructor Description
      UTFTokeniser()  
    • Field Detail

      • maxNumOfDigitsPerTerm

        protected static final int maxNumOfDigitsPerTerm
        The maximum number of digits that are allowed in valid terms.
        See Also:
        Constant Field Values
      • maxNumOfSameConseqLettersPerTerm

        protected static final int maxNumOfSameConseqLettersPerTerm
        The maximum number of consecutive same letters or digits that are allowed in valid terms.
        See Also:
        Constant Field Values
      • DROP_LONG_TOKENS

        protected static final boolean DROP_LONG_TOKENS
        Whether tokens longer than MAX_TERM_LENGTH should be dropped.
        See Also:
        Constant Field Values
    • Constructor Detail

      • UTFTokeniser

        public UTFTokeniser()
    • Method Detail

      • tokenise

        public TokenStream tokenise​(java.io.Reader reader)
        Description copied from class: Tokeniser
        Tokenises the text obtained from the specified reader.
        Specified by:
        tokenise in class Tokeniser
        Parameters:
        reader - Stream of text to be tokenised
        Returns:
        a TokenStream of the tokens found in the text.