Class UTFTwitterTokeniser

  • All Implemented Interfaces:
    java.io.Serializable

    public class UTFTwitterTokeniser
    extends Tokeniser
    A tokeniser designed for use on tweets. It maintains UTF-8 encoding and keeps mentions
    Since:
    4.0
    Author:
    Richard McCreadie
    See Also:
    Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected static boolean DROP_LONG_TOKENS
      Whether tokens longer than MAX_TERM_LENGTH should be dropped.
      protected static int maxNumOfDigitsPerTerm
      The maximum number of digits that are allowed in valid terms.
      protected static int maxNumOfSameConseqLettersPerTerm
      The maximum number of consecutive same letters or digits that are allowed in valid terms.
    • Field Detail

      • maxNumOfDigitsPerTerm

        protected static final int maxNumOfDigitsPerTerm
        The maximum number of digits that are allowed in valid terms.
        See Also:
        Constant Field Values
      • maxNumOfSameConseqLettersPerTerm

        protected static final int maxNumOfSameConseqLettersPerTerm
        The maximum number of consecutive same letters or digits that are allowed in valid terms.
        See Also:
        Constant Field Values
      • DROP_LONG_TOKENS

        protected static final boolean DROP_LONG_TOKENS
        Whether tokens longer than MAX_TERM_LENGTH should be dropped.
        See Also:
        Constant Field Values
    • Constructor Detail

      • UTFTwitterTokeniser

        public UTFTwitterTokeniser()
    • Method Detail

      • tokenise

        public TokenStream tokenise​(java.io.Reader reader)
        Description copied from class: Tokeniser
        Tokenises the text obtained from the specified reader.
        Specified by:
        tokenise in class Tokeniser
        Parameters:
        reader - Stream of text to be tokenised
        Returns:
        a TokenStream of the tokens found in the text.