java.lang.Object
- org.terrier.indexing.TaggedDocument

All Implemented Interfaces:

Document
```
public class TaggedDocument
extends java.lang.Object
implements Document
```
Models a tagged document (e.g., an HTML or TREC document). In particular, getNextTerm() returns the next token in the current chunk of text, according to the specified tokeniser. This class uses the following properties:
- tokeniser, the tokeniser class to be used (defaults to EnglishTokeniser);
- max.term.length, the maximum length in characters of a term (defaults to 20);
- lowercase, whether characters are transformed to lowercase (defaults to true).
- TaggedDocument.abstracts - names of the abstracts to be saved for query-biased summarisation. Defaults to empty. Example: TaggedDocument.abstracts=title,abstract
- TaggedDocument.abstracts.tags - names of tags to save text from for the purposes of query-biased summarisation. Example: TaggedDocument.abstracts=title,body. ELSE is special tag name, which means anything not consumed by other tags.
- TaggedDocument.abstracts.lengths - max lengths of the asbtracts. Defaults to empty. Example: TaggedDocument.abstracts.lengths=100,2048
- TaggedDocument.abstracts.tags.casesensitive - should the names of tags be case-sensitive? Defaults to false.
Since:

3.5

Author:

Craig Macdonald, Vassilis Plachouras, Richard McCreadie, Rodrygo Santos

Field Summary

Fields
Modifier and Type	Field	Description
`protected TagSet`	`_exact`	The tags to process exactly.
`protected TagSet`	`_fields`	The tags to consider as fields.
`protected TagSet`	`_tags`	The tags to process or skip.
`protected int`	`abstractCount`	number of abstract types
`protected int[]`	`abstractlengths`	The maximum length of each named abstract (comma separated list)
`protected gnu.trove.TObjectIntHashMap<java.lang.String>`	`abstractName2Index`	A mapping for quick lookup of abstract tag names
`protected java.lang.String[]`	`abstractnames`	The names of the abstracts to be saved (comma separated list)
`protected java.lang.StringBuilder[]`	`abstracts`	builders for each abstract
`protected java.lang.String[]`	`abstracttags`	The fields that the named abstracts come from (comma separated list)
`protected boolean`	`abstractTagsCaseSensitive`
`protected java.io.Reader`	`br`	The input reader.
`protected boolean`	`considerAbstracts`	Flag to check that determines whether to short-cut the abstract generation method
`protected long`	`counter`	The number of bytes read from the input.
`protected TokenStream`	`currentTokenStream`
`protected int`	`elseAbstractSpecialTag`	else field index
`protected boolean`	`EOD`	End of Document.
`protected boolean`	`error`	Indicates whether an error has occurred.
`protected java.util.Set<java.lang.String>`	`htmlStk`	The hash set where the tags, considered as fields, are inserted.
`protected boolean`	`inHtmlTagToProcess`	Specifies whether the tokeniser is in a field tag to process.
`protected boolean`	`inTagToProcess`	Indicates whether we are in a tag to process.
`protected boolean`	`inTagToSkip`	Indicates whether we are in a tag to skip.
`protected int`	`lastChar`	Saves the last read character between consecutive calls of getNextTerm().
`protected static org.slf4j.Logger`	`logger`
`protected static boolean`	`lowercase`	Change to lowercase?
`protected static int`	`maxNumOfDigitsPerTerm`	The maximum number of digits that are allowed in valid terms.
`protected static int`	`maxNumOfSameConseqLettersPerTerm`	The maximum number of consecutive same letters or digits that are allowed in valid terms.
`protected java.util.Map<java.lang.String,java.lang.String>`	`properties`
`protected java.util.Stack<java.lang.String>`	`stk`	The stack where the tags are pushed and popped accordingly.
`protected java.lang.String[]`	`stringArray`	A temporary String array
`protected java.lang.StringBuilder`	`sw`
`protected java.lang.StringBuilder`	`tagNameSB`
`protected Tokeniser`	`tokeniser`
`protected static int`	`tokenMaximumLength`	The maximum length of a token in the check method.

Constructor Summary

Constructors
Constructor	Description
`TaggedDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser _tokeniser)`	Constructs an instance of the class from the given input stream.
`TaggedDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser _tokeniser, java.lang.String doctags, java.lang.String exactdoctags, java.lang.String fieldtags)`	Constructs an instance of the class from the given input stream.
`TaggedDocument(java.io.Reader docReader, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser _tokeniser)`	Constructs an instance of the class from the given reader object.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`static java.lang.String`	`check(java.lang.String s)`	Checks whether a term is shorter than the maximum allowed length, and whether a term does not have many numerical digits or many consecutive same digits or letters.
`static void`	`dumpDocument(Document d)`	Dumps a document to stdout
`boolean`	`endOfDocument()`	Indicates whether the tokenizer has reached the end of the current document.
`static Document`	`generateDocumentFromFile(java.lang.String filename)`	instantiates a TREC document from a file
`java.util.Map<java.lang.String,java.lang.String>`	`getAllProperties()`	Returns the underlying map of all the properties defined by this Document.
`java.util.Set<java.lang.String>`	`getFields()`	Returns the fields in which the current term appears in.
`java.lang.String`	`getNextTerm()`	Returns the next token from the current chunk of text, extracted from the document into a TokenStream.
`java.lang.String`	`getProperty(java.lang.String name)`	Allows access to a named property of the Document.
`java.io.Reader`	`getReader()`	Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.
`static void`	`main(java.lang.String[] args)`	Static method which dumps a document to System.out
`protected void`	`processEndOfDocument()`
`protected void`	`processEndOfTag(java.lang.String tag)`	The encountered tag, which must be a final tag is matched with the tag on the stack.
`protected void`	`saveToAbstract(java.lang.String text, java.lang.String tag)`	This method takes the text parsed from a tag and then saves it to the abstract(s).
`void`	`setProperty(java.lang.String name, java.lang.String value)`	Allows a named property to be added to the Document.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - logger
```
protected static final org.slf4j.Logger logger
```
  - tokenMaximumLength
```
protected static final int tokenMaximumLength
```
    The maximum length of a token in the check method.
  - lowercase
```
protected static final boolean lowercase
```
    Change to lowercase?
  - stringArray
```
protected final java.lang.String[] stringArray
```
    A temporary String array
  - br
```
protected java.io.Reader br
```
    The input reader.
  - EOD
```
protected boolean EOD
```
    End of Document. Set by the last couple of lines in getNextTerm()
  - counter
```
protected long counter
```
    The number of bytes read from the input.
  - lastChar
```
protected int lastChar
```
    Saves the last read character between consecutive calls of getNextTerm().
  - error
```
protected boolean error
```
    Indicates whether an error has occurred.
  - _tags
```
protected TagSet _tags
```
    The tags to process or skip.
  - _exact
```
protected TagSet _exact
```
    The tags to process exactly. For these tags, the check() method is not applied.
  - _fields
```
protected TagSet _fields
```
    The tags to consider as fields.
  - stk
```
protected java.util.Stack<java.lang.String> stk
```
    The stack where the tags are pushed and popped accordingly.
  - inTagToProcess
```
protected boolean inTagToProcess
```
    Indicates whether we are in a tag to process.
  - inTagToSkip
```
protected boolean inTagToSkip
```
    Indicates whether we are in a tag to skip.
  - htmlStk
```
protected java.util.Set<java.lang.String> htmlStk
```
    The hash set where the tags, considered as fields, are inserted.
  - inHtmlTagToProcess
```
protected boolean inHtmlTagToProcess
```
    Specifies whether the tokeniser is in a field tag to process.
  - properties
```
protected java.util.Map<java.lang.String,java.lang.String> properties
```
  - tokeniser
```
protected Tokeniser tokeniser
```
  - currentTokenStream
```
protected TokenStream currentTokenStream
```
  - abstractnames
```
protected final java.lang.String[] abstractnames
```
    The names of the abstracts to be saved (comma separated list)
  - abstracttags
```
protected final java.lang.String[] abstracttags
```
    The fields that the named abstracts come from (comma separated list)
  - abstractlengths
```
protected final int[] abstractlengths
```
    The maximum length of each named abstract (comma separated list)
  - abstractTagsCaseSensitive
```
protected final boolean abstractTagsCaseSensitive
```
  - abstractCount
```
protected final int abstractCount
```
    number of abstract types
  - abstracts
```
protected final java.lang.StringBuilder[] abstracts
```
    builders for each abstract
  - abstractName2Index
```
protected final gnu.trove.TObjectIntHashMap<java.lang.String> abstractName2Index
```
    A mapping for quick lookup of abstract tag names
  - considerAbstracts
```
protected final boolean considerAbstracts
```
    Flag to check that determines whether to short-cut the abstract generation method
  - elseAbstractSpecialTag
```
protected int elseAbstractSpecialTag
```
    else field index
  - sw
```
protected final java.lang.StringBuilder sw
```
  - tagNameSB
```
protected final java.lang.StringBuilder tagNameSB
```
  - maxNumOfDigitsPerTerm
```
protected static final int maxNumOfDigitsPerTerm
```
    The maximum number of digits that are allowed in valid terms.
    
    See Also:
    
    Constant Field Values
  - maxNumOfSameConseqLettersPerTerm
```
protected static final int maxNumOfSameConseqLettersPerTerm
```
    The maximum number of consecutive same letters or digits that are allowed in valid terms.
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - TaggedDocument
```
public TaggedDocument(java.io.InputStream docStream,
                      java.util.Map<java.lang.String,java.lang.String> docProperties,
                      Tokeniser _tokeniser)
```
    Constructs an instance of the class from the given input stream.
    
    Parameters:
    
    docStream -
    
    docProperties -
    
    _tokeniser -
  - TaggedDocument
```
public TaggedDocument(java.io.InputStream docStream,
                      java.util.Map<java.lang.String,java.lang.String> docProperties,
                      Tokeniser _tokeniser,
                      java.lang.String doctags,
                      java.lang.String exactdoctags,
                      java.lang.String fieldtags)
```
    Constructs an instance of the class from the given input stream.
    
    Parameters:
    
    docStream -
    
    docProperties -
    
    _tokeniser -
    
    doctags -
    
    exactdoctags -
    
    fieldtags -
  - TaggedDocument
```
public TaggedDocument(java.io.Reader docReader,
                      java.util.Map<java.lang.String,java.lang.String> docProperties,
                      Tokeniser _tokeniser)
```
    Constructs an instance of the class from the given reader object.
    
    Parameters:
    
    docReader - Reader the stream from the collection that ends at the end of the current document.
- Method Detail
  - getReader
```
public java.io.Reader getReader()
```
    Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.
    
    Specified by:
    
    getReader in interface Document
  - getNextTerm
```
public java.lang.String getNextTerm()
```
    Returns the next token from the current chunk of text, extracted from the document into a TokenStream.
    
    Specified by:
    
    getNextTerm in interface Document
    
    Returns:
    
    String the next token of the document, or null if the token was discarded during tokenisation.
  - processEndOfDocument
```
protected void processEndOfDocument()
```
  - saveToAbstract
```
protected void saveToAbstract(java.lang.String text,
                              java.lang.String tag)
```
    This method takes the text parsed from a tag and then saves it to the abstract(s). This method contains the logic to decide whether indeed the text or some subset of it should be saved. The default behaviour checks each abstract named in TaggedDocument.absracts, if for an abstract we are in the correct field (specified in TaggedDocument.abstracts.tags) and then it saves up to maximum character length specified in TaggedDocument.abstracts.lengths. The 'ELSE' abstract tag is a special case that will be filled with any tag that is not added to an existing abstract. TaggedDocument should be sub-classed and this method overwritten if you want to save abstracts in a different manner, e.g. saving the first paragraph.
    
    Parameters:
    
    text - - the text to be saved
    
    tag - - the tag that this text came from
  - getFields
```
public java.util.Set<java.lang.String> getFields()
```
    Returns the fields in which the current term appears in.
    
    Specified by:
    
    getFields in interface Document
    
    Returns:
    
    HashSet a hashset containing the fields that the current term appears in.
  - endOfDocument
```
public boolean endOfDocument()
```
    Indicates whether the tokenizer has reached the end of the current document.
    
    Specified by:
    
    endOfDocument in interface Document
    
    Returns:
    
    boolean true if the end of the current document has been reached, otherwise returns false.
  - processEndOfTag
```
protected void processEndOfTag(java.lang.String tag)
```
    The encountered tag, which must be a final tag is matched with the tag on the stack. If they are not the same, then the consistency is restored by popping the tags in the stack, the observed tag included. If the stack becomes empty after that, then the end of document EOD is set to true.
    
    Parameters:
    
    tag - The closing tag to be tested against the content of the stack.
  - check
```
public static java.lang.String check(java.lang.String s)
```
    Checks whether a term is shorter than the maximum allowed length, and whether a term does not have many numerical digits or many consecutive same digits or letters.
    
    Parameters:
    
    s - String the term to check if it is valid.
    
    Returns:
    
    String the term if it is valid, otherwise it returns null.
  - getProperty
```
public java.lang.String getProperty(java.lang.String name)
```
    Allows access to a named property of the Document. Examples might be URL, filename etc.
    
    Specified by:
    
    getProperty in interface Document
    
    Parameters:
    
    name - Name of the property. It is suggested, but not required that this name should not be case insensitive.
    
    Since:
    
    1.1.0
  - setProperty
```
public void setProperty(java.lang.String name,
                        java.lang.String value)
```
    Allows a named property to be added to the Document. Examples might be URL, filename etc.
    
    Parameters:
    
    name - Name of the property. It is suggested, but not required that this name should not be case insensitive.
    
    value - The value of the property
    
    Since:
    
    1.1.0
  - getAllProperties
```
public java.util.Map<java.lang.String,java.lang.String> getAllProperties()
```
    Returns the underlying map of all the properties defined by this Document.
    
    Specified by:
    
    getAllProperties in interface Document
    
    Since:
    
    1.1.0
  - main
```
public static void main(java.lang.String[] args)
```
    Static method which dumps a document to System.out
    
    Parameters:
    
    args - A filename to parse
  - generateDocumentFromFile
```
public static Document generateDocumentFromFile(java.lang.String filename)
```
    instantiates a TREC document from a file
  - dumpDocument
```
public static void dumpDocument(Document d)
```
    Dumps a document to stdout
    
    Parameters:
    
    d - a Document object

Class TaggedDocument

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

logger

tokenMaximumLength

lowercase

stringArray

br

EOD

counter

lastChar

error

_tags

_exact

_fields

stk

inTagToProcess

inTagToSkip

htmlStk

inHtmlTagToProcess

properties

tokeniser

currentTokenStream

abstractnames

abstracttags

abstractlengths

abstractTagsCaseSensitive

abstractCount

abstracts

abstractName2Index

considerAbstracts

elseAbstractSpecialTag

sw

tagNameSB

maxNumOfDigitsPerTerm

maxNumOfSameConseqLettersPerTerm

Constructor Detail

TaggedDocument

TaggedDocument

TaggedDocument

Method Detail

getReader

getNextTerm

processEndOfDocument

saveToAbstract

getFields

endOfDocument

processEndOfTag

check

getProperty

setProperty

getAllProperties

main

generateDocumentFromFile

dumpDocument