Package org.terrier.indexing
Class FileDocument
- java.lang.Object
-
- org.terrier.indexing.FileDocument
-
- All Implemented Interfaces:
Document
- Direct Known Subclasses:
PDFDocument,POIDocument
public class FileDocument extends java.lang.Object implements Document
Models a document which corresponds to one file. The first FileDocument.abstract.length characters can be saved as an abstract.- Author:
- Craig Macdonald, Vassilis Plachouras, Richard McCreadie, Rodrygo Santos
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description classFileDocument.ReaderWrapperA wrapper around the token stream used to lift the terms from the stream for storage in the abstract
-
Field Summary
Fields Modifier and Type Field Description protected intabstractlengthThe maximum length of each named abstract (comma separated list)protected java.lang.StringabstractnameThe names of the abstracts to be saved (comma separated list)protected intabstractwrittenThe number of characters currently writtenprotected java.io.ReaderbrThe input reader.protected booleanEODEnd of Document.protected java.lang.StringfilenameThe name of the file represented by this document.protected java.util.Map<java.lang.String,java.lang.String>filePropertiesThe number of bytes read from the input.protected static org.slf4j.Loggerloggerprotected TokenStreamtokenStream
-
Constructor Summary
Constructors Modifier Constructor Description protectedFileDocument()FileDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)Constructs an instance of the FileDocument from the given input stream.FileDocument(java.io.Reader docReader, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)create a document for a fileFileDocument(java.lang.String _filename, java.io.InputStream docStream, Tokeniser tok)create a document for a fileFileDocument(java.lang.String _filename, java.io.Reader docReader, Tokeniser tok)create a document for a file
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description booleanendOfDocument()Indicates whether the end of a document has been reached.java.util.Map<java.lang.String,java.lang.String>getAllProperties()Returns the underlying map of all the properties defined by this Document.java.util.Set<java.lang.String>getFields()Returns null because there is no support for fields with file documents.java.lang.StringgetNextTerm()Gets the next term from the Documentjava.lang.StringgetProperty(java.lang.String name)Get a document propertyjava.io.ReadergetReader()Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.protected java.io.ReadergetReader(java.io.InputStream docStream)Returns a buffered reader that encapsulates the given input stream.protected static java.util.Map<java.lang.String,java.lang.String>makeFilenameProperties(java.lang.String filename)voidsetProperty(java.lang.String name, java.lang.String value)Set a document property
-
-
-
Field Detail
-
logger
protected static final org.slf4j.Logger logger
-
br
protected java.io.Reader br
The input reader.
-
EOD
protected boolean EOD
End of Document. Set by the last couple of lines in getNextTerm()
-
fileProperties
protected java.util.Map<java.lang.String,java.lang.String> fileProperties
The number of bytes read from the input.
-
filename
protected java.lang.String filename
The name of the file represented by this document.
-
tokenStream
protected TokenStream tokenStream
-
abstractname
protected final java.lang.String abstractname
The names of the abstracts to be saved (comma separated list)
-
abstractlength
protected final int abstractlength
The maximum length of each named abstract (comma separated list)
-
abstractwritten
protected int abstractwritten
The number of characters currently written
-
-
Constructor Detail
-
FileDocument
protected FileDocument()
-
FileDocument
public FileDocument(java.lang.String _filename, java.io.Reader docReader, Tokeniser tok)create a document for a file- Parameters:
_filename-docReader-tok-
-
FileDocument
public FileDocument(java.lang.String _filename, java.io.InputStream docStream, Tokeniser tok)create a document for a file- Parameters:
_filename-docStream-tok-
-
FileDocument
public FileDocument(java.io.Reader docReader, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)create a document for a file- Parameters:
docReader-docProperties-tok-
-
FileDocument
public FileDocument(java.io.InputStream docStream, java.util.Map<java.lang.String,java.lang.String> docProperties, Tokeniser tok)Constructs an instance of the FileDocument from the given input stream.- Parameters:
docStream- the input stream that reads the file.
-
-
Method Detail
-
makeFilenameProperties
protected static java.util.Map<java.lang.String,java.lang.String> makeFilenameProperties(java.lang.String filename)
-
getReader
public java.io.Reader getReader()
Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.
-
getReader
protected java.io.Reader getReader(java.io.InputStream docStream)
Returns a buffered reader that encapsulates the given input stream.- Parameters:
docStream- an input stream that we want to access as a buffered reader.- Returns:
- the buffered reader that encapsulates the given input stream.
-
getNextTerm
public java.lang.String getNextTerm()
Gets the next term from the Document- Specified by:
getNextTermin interfaceDocument- Returns:
- String the next term of the document. Null returns should be ignored.
-
getFields
public java.util.Set<java.lang.String> getFields()
Returns null because there is no support for fields with file documents.
-
endOfDocument
public boolean endOfDocument()
Indicates whether the end of a document has been reached.- Specified by:
endOfDocumentin interfaceDocument- Returns:
- boolean true if the end of a document has been reached, otherwise, it returns false.
-
getProperty
public java.lang.String getProperty(java.lang.String name)
Get a document property- Specified by:
getPropertyin interfaceDocument- Parameters:
name- Name of the property. It is suggested, but not required that this name should not be case insensitive.
-
setProperty
public void setProperty(java.lang.String name, java.lang.String value)Set a document property
-
getAllProperties
public java.util.Map<java.lang.String,java.lang.String> getAllProperties()
Returns the underlying map of all the properties defined by this Document.- Specified by:
getAllPropertiesin interfaceDocument
-
-