Class TaggedDocument

  • All Implemented Interfaces:
    Document

    public class TaggedDocument
    extends java.lang.Object
    implements Document
    Models a tagged document (e.g., an HTML or TREC document). In particular, getNextTerm() returns the next token in the current chunk of text, according to the specified tokeniser. This class uses the following properties:
    • tokeniser, the tokeniser class to be used (defaults to EnglishTokeniser);
    • max.term.length, the maximum length in characters of a term (defaults to 20);
    • lowercase, whether characters are transformed to lowercase (defaults to true).
    • TaggedDocument.abstracts - names of the abstracts to be saved for query-biased summarisation. Defaults to empty. Example: TaggedDocument.abstracts=title,abstract
    • TaggedDocument.abstracts.tags - names of tags to save text from for the purposes of query-biased summarisation. Example: TaggedDocument.abstracts=title,body. ELSE is special tag name, which means anything not consumed by other tags.
    • TaggedDocument.abstracts.lengths - max lengths of the asbtracts. Defaults to empty. Example: TaggedDocument.abstracts.lengths=100,2048
    • TaggedDocument.abstracts.tags.casesensitive - should the names of tags be case-sensitive? Defaults to false.
    Since:
    3.5
    Author:
    Craig Macdonald, Vassilis Plachouras, Richard McCreadie, Rodrygo Santos
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected TagSet _exact
      The tags to process exactly.
      protected TagSet _fields
      The tags to consider as fields.
      protected TagSet _tags
      The tags to process or skip.
      protected int abstractCount
      number of abstract types
      protected int[] abstractlengths
      The maximum length of each named abstract (comma separated list)
      protected gnu.trove.TObjectIntHashMap<java.lang.String> abstractName2Index
      A mapping for quick lookup of abstract tag names
      protected java.lang.String[] abstractnames
      The names of the abstracts to be saved (comma separated list)
      protected java.lang.StringBuilder[] abstracts
      builders for each abstract
      protected java.lang.String[] abstracttags
      The fields that the named abstracts come from (comma separated list)
      protected boolean abstractTagsCaseSensitive  
      protected java.io.Reader br
      The input reader.
      protected boolean considerAbstracts
      Flag to check that determines whether to short-cut the abstract generation method
      protected long counter
      The number of bytes read from the input.
      protected TokenStream currentTokenStream  
      protected int elseAbstractSpecialTag
      else field index
      protected boolean EOD
      End of Document.
      protected boolean error
      Indicates whether an error has occurred.
      protected java.util.Set<java.lang.String> htmlStk
      The hash set where the tags, considered as fields, are inserted.
      protected boolean inHtmlTagToProcess
      Specifies whether the tokeniser is in a field tag to process.
      protected boolean inTagToProcess
      Indicates whether we are in a tag to process.
      protected boolean inTagToSkip
      Indicates whether we are in a tag to skip.
      protected int lastChar
      Saves the last read character between consecutive calls of getNextTerm().
      protected static org.slf4j.Logger logger  
      protected static boolean lowercase
      Change to lowercase?
      protected static int maxNumOfDigitsPerTerm
      The maximum number of digits that are allowed in valid terms.
      protected static int maxNumOfSameConseqLettersPerTerm
      The maximum number of consecutive same letters or digits that are allowed in valid terms.
      protected java.util.Map<java.lang.String,​java.lang.String> properties  
      protected java.util.Stack<java.lang.String> stk
      The stack where the tags are pushed and popped accordingly.
      protected java.lang.String[] stringArray
      A temporary String array
      protected java.lang.StringBuilder sw  
      protected java.lang.StringBuilder tagNameSB  
      protected Tokeniser tokeniser  
      protected static int tokenMaximumLength
      The maximum length of a token in the check method.
    • Constructor Summary

      Constructors 
      Constructor Description
      TaggedDocument​(java.io.InputStream docStream, java.util.Map<java.lang.String,​java.lang.String> docProperties, Tokeniser _tokeniser)
      Constructs an instance of the class from the given input stream.
      TaggedDocument​(java.io.InputStream docStream, java.util.Map<java.lang.String,​java.lang.String> docProperties, Tokeniser _tokeniser, java.lang.String doctags, java.lang.String exactdoctags, java.lang.String fieldtags)
      Constructs an instance of the class from the given input stream.
      TaggedDocument​(java.io.Reader docReader, java.util.Map<java.lang.String,​java.lang.String> docProperties, Tokeniser _tokeniser)
      Constructs an instance of the class from the given reader object.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static java.lang.String check​(java.lang.String s)
      Checks whether a term is shorter than the maximum allowed length, and whether a term does not have many numerical digits or many consecutive same digits or letters.
      static void dumpDocument​(Document d)
      Dumps a document to stdout
      boolean endOfDocument()
      Indicates whether the tokenizer has reached the end of the current document.
      static Document generateDocumentFromFile​(java.lang.String filename)
      instantiates a TREC document from a file
      java.util.Map<java.lang.String,​java.lang.String> getAllProperties()
      Returns the underlying map of all the properties defined by this Document.
      java.util.Set<java.lang.String> getFields()
      Returns the fields in which the current term appears in.
      java.lang.String getNextTerm()
      Returns the next token from the current chunk of text, extracted from the document into a TokenStream.
      java.lang.String getProperty​(java.lang.String name)
      Allows access to a named property of the Document.
      java.io.Reader getReader()
      Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.
      static void main​(java.lang.String[] args)
      Static method which dumps a document to System.out
      protected void processEndOfDocument()  
      protected void processEndOfTag​(java.lang.String tag)
      The encountered tag, which must be a final tag is matched with the tag on the stack.
      protected void saveToAbstract​(java.lang.String text, java.lang.String tag)
      This method takes the text parsed from a tag and then saves it to the abstract(s).
      void setProperty​(java.lang.String name, java.lang.String value)
      Allows a named property to be added to the Document.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • logger

        protected static final org.slf4j.Logger logger
      • tokenMaximumLength

        protected static final int tokenMaximumLength
        The maximum length of a token in the check method.
      • lowercase

        protected static final boolean lowercase
        Change to lowercase?
      • stringArray

        protected final java.lang.String[] stringArray
        A temporary String array
      • br

        protected java.io.Reader br
        The input reader.
      • EOD

        protected boolean EOD
        End of Document. Set by the last couple of lines in getNextTerm()
      • counter

        protected long counter
        The number of bytes read from the input.
      • lastChar

        protected int lastChar
        Saves the last read character between consecutive calls of getNextTerm().
      • error

        protected boolean error
        Indicates whether an error has occurred.
      • _tags

        protected TagSet _tags
        The tags to process or skip.
      • _exact

        protected TagSet _exact
        The tags to process exactly. For these tags, the check() method is not applied.
      • _fields

        protected TagSet _fields
        The tags to consider as fields.
      • stk

        protected java.util.Stack<java.lang.String> stk
        The stack where the tags are pushed and popped accordingly.
      • inTagToProcess

        protected boolean inTagToProcess
        Indicates whether we are in a tag to process.
      • inTagToSkip

        protected boolean inTagToSkip
        Indicates whether we are in a tag to skip.
      • htmlStk

        protected java.util.Set<java.lang.String> htmlStk
        The hash set where the tags, considered as fields, are inserted.
      • inHtmlTagToProcess

        protected boolean inHtmlTagToProcess
        Specifies whether the tokeniser is in a field tag to process.
      • properties

        protected java.util.Map<java.lang.String,​java.lang.String> properties
      • currentTokenStream

        protected TokenStream currentTokenStream
      • abstractnames

        protected final java.lang.String[] abstractnames
        The names of the abstracts to be saved (comma separated list)
      • abstracttags

        protected final java.lang.String[] abstracttags
        The fields that the named abstracts come from (comma separated list)
      • abstractlengths

        protected final int[] abstractlengths
        The maximum length of each named abstract (comma separated list)
      • abstractTagsCaseSensitive

        protected final boolean abstractTagsCaseSensitive
      • abstractCount

        protected final int abstractCount
        number of abstract types
      • abstracts

        protected final java.lang.StringBuilder[] abstracts
        builders for each abstract
      • abstractName2Index

        protected final gnu.trove.TObjectIntHashMap<java.lang.String> abstractName2Index
        A mapping for quick lookup of abstract tag names
      • considerAbstracts

        protected final boolean considerAbstracts
        Flag to check that determines whether to short-cut the abstract generation method
      • elseAbstractSpecialTag

        protected int elseAbstractSpecialTag
        else field index
      • sw

        protected final java.lang.StringBuilder sw
      • tagNameSB

        protected final java.lang.StringBuilder tagNameSB
      • maxNumOfDigitsPerTerm

        protected static final int maxNumOfDigitsPerTerm
        The maximum number of digits that are allowed in valid terms.
        See Also:
        Constant Field Values
      • maxNumOfSameConseqLettersPerTerm

        protected static final int maxNumOfSameConseqLettersPerTerm
        The maximum number of consecutive same letters or digits that are allowed in valid terms.
        See Also:
        Constant Field Values
    • Constructor Detail

      • TaggedDocument

        public TaggedDocument​(java.io.InputStream docStream,
                              java.util.Map<java.lang.String,​java.lang.String> docProperties,
                              Tokeniser _tokeniser)
        Constructs an instance of the class from the given input stream.
        Parameters:
        docStream -
        docProperties -
        _tokeniser -
      • TaggedDocument

        public TaggedDocument​(java.io.InputStream docStream,
                              java.util.Map<java.lang.String,​java.lang.String> docProperties,
                              Tokeniser _tokeniser,
                              java.lang.String doctags,
                              java.lang.String exactdoctags,
                              java.lang.String fieldtags)
        Constructs an instance of the class from the given input stream.
        Parameters:
        docStream -
        docProperties -
        _tokeniser -
        doctags -
        exactdoctags -
        fieldtags -
      • TaggedDocument

        public TaggedDocument​(java.io.Reader docReader,
                              java.util.Map<java.lang.String,​java.lang.String> docProperties,
                              Tokeniser _tokeniser)
        Constructs an instance of the class from the given reader object.
        Parameters:
        docReader - Reader the stream from the collection that ends at the end of the current document.
    • Method Detail

      • getReader

        public java.io.Reader getReader()
        Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.
        Specified by:
        getReader in interface Document
      • getNextTerm

        public java.lang.String getNextTerm()
        Returns the next token from the current chunk of text, extracted from the document into a TokenStream.
        Specified by:
        getNextTerm in interface Document
        Returns:
        String the next token of the document, or null if the token was discarded during tokenisation.
      • processEndOfDocument

        protected void processEndOfDocument()
      • saveToAbstract

        protected void saveToAbstract​(java.lang.String text,
                                      java.lang.String tag)
        This method takes the text parsed from a tag and then saves it to the abstract(s). This method contains the logic to decide whether indeed the text or some subset of it should be saved. The default behaviour checks each abstract named in TaggedDocument.absracts, if for an abstract we are in the correct field (specified in TaggedDocument.abstracts.tags) and then it saves up to maximum character length specified in TaggedDocument.abstracts.lengths. The 'ELSE' abstract tag is a special case that will be filled with any tag that is not added to an existing abstract. TaggedDocument should be sub-classed and this method overwritten if you want to save abstracts in a different manner, e.g. saving the first paragraph.
        Parameters:
        text - - the text to be saved
        tag - - the tag that this text came from
      • getFields

        public java.util.Set<java.lang.String> getFields()
        Returns the fields in which the current term appears in.
        Specified by:
        getFields in interface Document
        Returns:
        HashSet a hashset containing the fields that the current term appears in.
      • endOfDocument

        public boolean endOfDocument()
        Indicates whether the tokenizer has reached the end of the current document.
        Specified by:
        endOfDocument in interface Document
        Returns:
        boolean true if the end of the current document has been reached, otherwise returns false.
      • processEndOfTag

        protected void processEndOfTag​(java.lang.String tag)
        The encountered tag, which must be a final tag is matched with the tag on the stack. If they are not the same, then the consistency is restored by popping the tags in the stack, the observed tag included. If the stack becomes empty after that, then the end of document EOD is set to true.
        Parameters:
        tag - The closing tag to be tested against the content of the stack.
      • check

        public static java.lang.String check​(java.lang.String s)
        Checks whether a term is shorter than the maximum allowed length, and whether a term does not have many numerical digits or many consecutive same digits or letters.
        Parameters:
        s - String the term to check if it is valid.
        Returns:
        String the term if it is valid, otherwise it returns null.
      • getProperty

        public java.lang.String getProperty​(java.lang.String name)
        Allows access to a named property of the Document. Examples might be URL, filename etc.
        Specified by:
        getProperty in interface Document
        Parameters:
        name - Name of the property. It is suggested, but not required that this name should not be case insensitive.
        Since:
        1.1.0
      • setProperty

        public void setProperty​(java.lang.String name,
                                java.lang.String value)
        Allows a named property to be added to the Document. Examples might be URL, filename etc.
        Parameters:
        name - Name of the property. It is suggested, but not required that this name should not be case insensitive.
        value - The value of the property
        Since:
        1.1.0
      • getAllProperties

        public java.util.Map<java.lang.String,​java.lang.String> getAllProperties()
        Returns the underlying map of all the properties defined by this Document.
        Specified by:
        getAllProperties in interface Document
        Since:
        1.1.0
      • main

        public static void main​(java.lang.String[] args)
        Static method which dumps a document to System.out
        Parameters:
        args - A filename to parse
      • generateDocumentFromFile

        public static Document generateDocumentFromFile​(java.lang.String filename)
        instantiates a TREC document from a file
      • dumpDocument

        public static void dumpDocument​(Document d)
        Dumps a document to stdout
        Parameters:
        d - a Document object