Class Stopwords

  • All Implemented Interfaces:
    TermPipeline

    public class Stopwords
    extends java.lang.Object
    implements TermPipeline
    Implements stopword removal, as a TermPipeline object. Stopword list to load can be passed in the constructor or loaded from the stopwords.filename property. Note that this TermPipeline uses the system default encoding for the stopword list. Properties
    • stopwords.filename - the stopword list to load. More than one stopword list can be specified, by comma-separating the filenames. The default is resource:/stopword-list.txt which is included in the terrier-core jar file.
    • stopwords.intern.terms - optimisation of Java for indexing: Stopwords terms are likely to appear extremely frequently in a Collection, interning them in Java will save on GC costs during indexing.
    • stopwords.encoding - encoding of the file containing the stopwords and if that is not set, onto the default system encoding.
    Author:
    Craig Macdonald
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected static boolean INTERN_STOPWORDS  
      protected TermPipeline next
      The next component in the term pipeline.
      protected gnu.trove.THashSet<java.lang.String> stopWords
      The hashset that contains all the stop words.
    • Constructor Summary

      Constructors 
      Constructor Description
      Stopwords​(TermPipeline _next)
      Makes a new stopword termpipeline object.
      Stopwords​(TermPipeline _next, java.lang.String StopwordsFile)
      Makes a new stopword term pipeline object.
      Stopwords​(TermPipeline _next, java.lang.String[] StopwordsFiles)
      Makes a new stopword term pipeline object.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void clear()
      Clear all stopwords from this stopword list object.
      boolean isStopword​(java.lang.String t)
      Returns true is term t is a stopword
      void loadStopwordsList​(java.lang.String stopwordsFilename)
      Loads the specified stopwords file.
      void loadStopwordsList​(java.lang.String[] StopwordsFiles)
      Loads the specified stopwords files.
      void processTerm​(java.lang.String t)
      Checks to see if term t is a stopword.
      boolean reset()
      This method implements the specific rest option needed to implements query or doc oriented policy.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • INTERN_STOPWORDS

        protected static final boolean INTERN_STOPWORDS
      • next

        protected final TermPipeline next
        The next component in the term pipeline.
      • stopWords

        protected final gnu.trove.THashSet<java.lang.String> stopWords
        The hashset that contains all the stop words.
    • Constructor Detail

      • Stopwords

        public Stopwords​(TermPipeline _next)
        Makes a new stopword termpipeline object. The stopwords file is loaded from the application setup file, under the property stopwords.filename.
        Parameters:
        _next - TermPipeline the next component in the term pipeline.
      • Stopwords

        public Stopwords​(TermPipeline _next,
                         java.lang.String StopwordsFile)
        Makes a new stopword term pipeline object. The stopwords file(s) are loaded from the filename parameter. If the filename is not absolute, it is assumed to be in TERRIER_SHARE. StopwordsFile is split on \s*,\s* if a comma is found in StopwordsFile parameter.
        Parameters:
        _next - TermPipeline the next component in the term pipeline
        StopwordsFile - The filename(s) of the file to use as the stopwords list. Split on comma, and passed to the (TermPipeline,String[]) constructor.
      • Stopwords

        public Stopwords​(TermPipeline _next,
                         java.lang.String[] StopwordsFiles)
        Makes a new stopword term pipeline object. The stopwords file(s) are loaded from the filenames array parameter. The non-existance of any file is not enough to stop the system. If a filename is not absolute, it is is assumed to be in TERRIER_SHARE.
        Parameters:
        _next - TermPipeline the next component in the term pipeline
        StopwordsFiles - Array of filenames of stopword lists.
        Since:
        1.1.0
    • Method Detail

      • loadStopwordsList

        public void loadStopwordsList​(java.lang.String[] StopwordsFiles)
        Loads the specified stopwords files. Calls loadStopwordsList(String).
        Parameters:
        StopwordsFiles - Array of filenames of stopword lists.
        Since:
        1.1.0
      • loadStopwordsList

        public void loadStopwordsList​(java.lang.String stopwordsFilename)
        Loads the specified stopwords file. Used internally by Stopwords(TermPipeline, String[]). If a stopword list filename is not absolute, then ApplicationSetup.TERRIER_SHARE is appended.
        Parameters:
        stopwordsFilename - The filename of the file to use as the stopwords list.
      • clear

        public void clear()
        Clear all stopwords from this stopword list object.
        Since:
        1.1.0
      • isStopword

        public boolean isStopword​(java.lang.String t)
        Returns true is term t is a stopword
      • processTerm

        public void processTerm​(java.lang.String t)
        Checks to see if term t is a stopword. If so, then the TermPipeline is exited. Otherwise, the term is passed on to the next TermPipeline object. This is the TermPipeline implementation part of this object.
        Specified by:
        processTerm in interface TermPipeline
        Parameters:
        t - The term to be checked.
      • reset

        public boolean reset()
        This method implements the specific rest option needed to implements query or doc oriented policy.
        Specified by:
        reset in interface TermPipeline
        Returns:
        results of the operation