Class FileDocument

  • All Implemented Interfaces:
    Document
    Direct Known Subclasses:
    PDFDocument, POIDocument

    public class FileDocument
    extends java.lang.Object
    implements Document
    Models a document which corresponds to one file. The first FileDocument.abstract.length characters can be saved as an abstract.
    Author:
    Craig Macdonald, Vassilis Plachouras, Richard McCreadie, Rodrygo Santos
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      class  FileDocument.ReaderWrapper
      A wrapper around the token stream used to lift the terms from the stream for storage in the abstract
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected int abstractlength
      The maximum length of each named abstract (comma separated list)
      protected java.lang.String abstractname
      The names of the abstracts to be saved (comma separated list)
      protected int abstractwritten
      The number of characters currently written
      protected java.io.Reader br
      The input reader.
      protected boolean EOD
      End of Document.
      protected java.lang.String filename
      The name of the file represented by this document.
      protected java.util.Map<java.lang.String,​java.lang.String> fileProperties
      The number of bytes read from the input.
      protected static org.slf4j.Logger logger  
      protected TokenStream tokenStream  
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      protected FileDocument()  
        FileDocument​(java.io.InputStream docStream, java.util.Map<java.lang.String,​java.lang.String> docProperties, Tokeniser tok)
      Constructs an instance of the FileDocument from the given input stream.
        FileDocument​(java.io.Reader docReader, java.util.Map<java.lang.String,​java.lang.String> docProperties, Tokeniser tok)
      create a document for a file
        FileDocument​(java.lang.String _filename, java.io.InputStream docStream, Tokeniser tok)
      create a document for a file
        FileDocument​(java.lang.String _filename, java.io.Reader docReader, Tokeniser tok)
      create a document for a file
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      boolean endOfDocument()
      Indicates whether the end of a document has been reached.
      java.util.Map<java.lang.String,​java.lang.String> getAllProperties()
      Returns the underlying map of all the properties defined by this Document.
      java.util.Set<java.lang.String> getFields()
      Returns null because there is no support for fields with file documents.
      java.lang.String getNextTerm()
      Gets the next term from the Document
      java.lang.String getProperty​(java.lang.String name)
      Get a document property
      java.io.Reader getReader()
      Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.
      protected java.io.Reader getReader​(java.io.InputStream docStream)
      Returns a buffered reader that encapsulates the given input stream.
      protected static java.util.Map<java.lang.String,​java.lang.String> makeFilenameProperties​(java.lang.String filename)  
      void setProperty​(java.lang.String name, java.lang.String value)
      Set a document property
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • logger

        protected static final org.slf4j.Logger logger
      • br

        protected java.io.Reader br
        The input reader.
      • EOD

        protected boolean EOD
        End of Document. Set by the last couple of lines in getNextTerm()
      • fileProperties

        protected java.util.Map<java.lang.String,​java.lang.String> fileProperties
        The number of bytes read from the input.
      • filename

        protected java.lang.String filename
        The name of the file represented by this document.
      • abstractname

        protected final java.lang.String abstractname
        The names of the abstracts to be saved (comma separated list)
      • abstractlength

        protected final int abstractlength
        The maximum length of each named abstract (comma separated list)
      • abstractwritten

        protected int abstractwritten
        The number of characters currently written
    • Constructor Detail

      • FileDocument

        protected FileDocument()
      • FileDocument

        public FileDocument​(java.lang.String _filename,
                            java.io.Reader docReader,
                            Tokeniser tok)
        create a document for a file
        Parameters:
        _filename -
        docReader -
        tok -
      • FileDocument

        public FileDocument​(java.lang.String _filename,
                            java.io.InputStream docStream,
                            Tokeniser tok)
        create a document for a file
        Parameters:
        _filename -
        docStream -
        tok -
      • FileDocument

        public FileDocument​(java.io.Reader docReader,
                            java.util.Map<java.lang.String,​java.lang.String> docProperties,
                            Tokeniser tok)
        create a document for a file
        Parameters:
        docReader -
        docProperties -
        tok -
      • FileDocument

        public FileDocument​(java.io.InputStream docStream,
                            java.util.Map<java.lang.String,​java.lang.String> docProperties,
                            Tokeniser tok)
        Constructs an instance of the FileDocument from the given input stream.
        Parameters:
        docStream - the input stream that reads the file.
    • Method Detail

      • makeFilenameProperties

        protected static java.util.Map<java.lang.String,​java.lang.String> makeFilenameProperties​(java.lang.String filename)
      • getReader

        public java.io.Reader getReader()
        Returns the underlying buffered reader, so that client code can tokenise the document itself, and deal with it how it likes.
        Specified by:
        getReader in interface Document
      • getReader

        protected java.io.Reader getReader​(java.io.InputStream docStream)
        Returns a buffered reader that encapsulates the given input stream.
        Parameters:
        docStream - an input stream that we want to access as a buffered reader.
        Returns:
        the buffered reader that encapsulates the given input stream.
      • getNextTerm

        public java.lang.String getNextTerm()
        Gets the next term from the Document
        Specified by:
        getNextTerm in interface Document
        Returns:
        String the next term of the document. Null returns should be ignored.
      • getFields

        public java.util.Set<java.lang.String> getFields()
        Returns null because there is no support for fields with file documents.
        Specified by:
        getFields in interface Document
        Returns:
        null.
      • endOfDocument

        public boolean endOfDocument()
        Indicates whether the end of a document has been reached.
        Specified by:
        endOfDocument in interface Document
        Returns:
        boolean true if the end of a document has been reached, otherwise, it returns false.
      • getProperty

        public java.lang.String getProperty​(java.lang.String name)
        Get a document property
        Specified by:
        getProperty in interface Document
        Parameters:
        name - Name of the property. It is suggested, but not required that this name should not be case insensitive.
      • setProperty

        public void setProperty​(java.lang.String name,
                                java.lang.String value)
        Set a document property
      • getAllProperties

        public java.util.Map<java.lang.String,​java.lang.String> getAllProperties()
        Returns the underlying map of all the properties defined by this Document.
        Specified by:
        getAllProperties in interface Document