Class EnglishTokeniser

  • All Implemented Interfaces:
    java.io.Serializable

    public class EnglishTokeniser
    extends Tokeniser
    Tokenises text obtained from a text stream assuming English language. Acceptable characters are A-Z a-z and 0-9. All other characters cause a new token.

    Furthermore, there is an additional checking of terms, to reduce index noise, as follows:

    1. Any term which is longer than max.term.length (usually 20 characters) is discarded.
    2. Any term which has more than 4 digits is discarded.
    3. Any term which has more than 3 consecutive identical characters are discarded.
    Properties:
    • lowercase - should all terms be lowercased or not?
    • max.term.length - maximum acceptable term length, default is 20.
    Author:
    Gianni Amati, Ben He, Vassilis Plachouras, Craig Macdonald
    See Also:
    Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected static boolean DROP_LONG_TOKENS
      Whether tokens longer than MAX_TERM_LENGTH should be dropped.
      protected static int maxNumOfDigitsPerTerm
      The maximum number of digits that are allowed in valid terms.
      protected static int maxNumOfSameConseqLettersPerTerm
      The maximum number of consecutive same letters or digits that are allowed in valid terms.
    • Field Detail

      • maxNumOfDigitsPerTerm

        protected static final int maxNumOfDigitsPerTerm
        The maximum number of digits that are allowed in valid terms.
        See Also:
        Constant Field Values
      • maxNumOfSameConseqLettersPerTerm

        protected static final int maxNumOfSameConseqLettersPerTerm
        The maximum number of consecutive same letters or digits that are allowed in valid terms.
        See Also:
        Constant Field Values
      • DROP_LONG_TOKENS

        protected static final boolean DROP_LONG_TOKENS
        Whether tokens longer than MAX_TERM_LENGTH should be dropped.
        See Also:
        Constant Field Values
    • Constructor Detail

      • EnglishTokeniser

        public EnglishTokeniser()
    • Method Detail

      • tokenise

        public TokenStream tokenise​(java.io.Reader reader)
        Description copied from class: Tokeniser
        Tokenises the text obtained from the specified reader.
        Specified by:
        tokenise in class Tokeniser
        Parameters:
        reader - Stream of text to be tokenised
        Returns:
        a TokenStream of the tokens found in the text.