Class SimpleFileCollection

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable, Collection

    public class SimpleFileCollection
    extends java.lang.Object
    implements Collection
    Implements a collection that can read arbitrary files on disk. It will use the file list given to it in the constructor, or it will read the file specified by the property collection.spec. Properties:
    • indexing.simplefilecollection.extensionsparsers - a comma delimited lists of tuples, in the form "extension:DocumentClass". For instance, one tuple could be "txt:FileDocument". The default txt:FileDocument,text:FileDocument,tex:FileDocument,bib:FileDocument,pdf:PDFDocument,html:TaggedDocument,htm:TaggedDocument,xhtml:TaggedDocument,xml:TaggedDocument,doc:MSWordDocument,ppt:MSPowerpointDocument,xls:MSExcelDocument.
    • indexing.simplefilecollection.defaultparser - the default parser for any unknown extensions. If this property is empty, then such documents will not be opened.
    • indexing.simplefilecollection.recurse - whether directories should be opened looking for files.
    Author:
    Craig Macdonald & Vassilis Plachouras
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected java.lang.String currentDocno
      overridden docno for the current document
      protected java.io.InputStream currentStream
      The InputStream of the most recently opened document.
      static java.lang.String DEFAULT_MAPPING_PROPERTY
      What to parse each file type with
      protected int DocCounter
      The identifier of a document in the collection.
      protected java.util.Map<java.lang.String,​java.lang.Class<? extends Document>> extension_DocumentClass
      Maps filename extensions to Document classes.
      protected java.util.LinkedList<java.lang.String> FileList
      The list of files to index.
      protected java.util.List<java.lang.String> firstList
      Contains the list of files first handed to the SimpleFileCollection, allowing the SimpleFileCollection instance to be simply reset.
      protected java.util.List<java.lang.String> indexedFiles
      This is filled during traversal, so document IDs can be matched with filenames
      protected static org.slf4j.Logger logger  
      static java.lang.String NAMESPACE_DOCUMENTS
      The default namespace for all parsers to be loaded from.
      protected boolean Recurse
      Whether directories should be recursed into by this class
      protected java.lang.String thisFilename
      The filename of the current file we are processing.
      protected Tokeniser tokeniser  
    • Constructor Summary

      Constructors 
      Constructor Description
      SimpleFileCollection()
      A default constructor that uses the files to be processed by this collection, as specified by the property collection.spec
      SimpleFileCollection​(java.lang.String addressCollectionFilename)
      Creates an instance of the class.
      SimpleFileCollection​(java.lang.String addressCollectionFilename, java.lang.String ignored1, java.lang.String ignored2, java.lang.String ignored3)
      additional constructors required by TRECIndexing
      SimpleFileCollection​(java.util.List<java.lang.String> filelist, boolean recurse)
      Constructs an instance of the class with the given list of files.
      SimpleFileCollection​(java.util.List<java.lang.String> filelist, java.lang.String ignored1, java.lang.String ignored2, java.lang.String ignored3)
      additional constructors required by TRECIndexing
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected void addDirectoryListing()
      Called when thisFile is identified as a directory, this adds the entire contents of the directory onto the list to be processed.
      void close()  
      protected void createExtensionDocumentMapping()
      Parses the properties indexing.simplefilecollection.extensionsparsers and indexing.simplefilecollection.defaultparser and attempts to load all the mentioned classes, in a hashtable mapping filename extension to their respective parsers.
      boolean endOfCollection()
      Checks whether there are more documents in the colection.
      java.lang.String getDocCounter()
      Returns the current document's identifier string.
      Document getDocument()
      Return the current document in the collection.
      java.util.List<java.lang.String> getFileList()
      Returns the ist of indexed files in the order they were indexed in.
      boolean hasNext()
      Check whether there is a next document in the collection to be processed
      protected Document makeDocument​(java.lang.String Filename, java.io.InputStream in)
      Given the opened document in, of Filename and File f, work out which parser to try, and instantiate it.
      Document next()
      Move onto the next document in the collection to be processed.
      boolean nextDocument()
      Move onto the next document in the collection to be processed.
      void remove()
      This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations
      void reset()
      Starts again from the beginning of the collection.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • logger

        protected static final org.slf4j.Logger logger
      • NAMESPACE_DOCUMENTS

        public static final java.lang.String NAMESPACE_DOCUMENTS
        The default namespace for all parsers to be loaded from. Only used if the class name specified does not contain any periods ('.')
        See Also:
        Constant Field Values
      • DEFAULT_MAPPING_PROPERTY

        public static final java.lang.String DEFAULT_MAPPING_PROPERTY
        What to parse each file type with
        See Also:
        Constant Field Values
      • FileList

        protected java.util.LinkedList<java.lang.String> FileList
        The list of files to index.
      • firstList

        protected java.util.List<java.lang.String> firstList
        Contains the list of files first handed to the SimpleFileCollection, allowing the SimpleFileCollection instance to be simply reset.
      • indexedFiles

        protected java.util.List<java.lang.String> indexedFiles
        This is filled during traversal, so document IDs can be matched with filenames
      • DocCounter

        protected int DocCounter
        The identifier of a document in the collection.
      • currentDocno

        protected java.lang.String currentDocno
        overridden docno for the current document
      • Recurse

        protected boolean Recurse
        Whether directories should be recursed into by this class
      • extension_DocumentClass

        protected java.util.Map<java.lang.String,​java.lang.Class<? extends Document>> extension_DocumentClass
        Maps filename extensions to Document classes. The entry |DEFAULT| maps to the default document parser, specified by indexing.simplefilecollection.defaultparser
      • thisFilename

        protected java.lang.String thisFilename
        The filename of the current file we are processing.
      • currentStream

        protected java.io.InputStream currentStream
        The InputStream of the most recently opened document. This is used to ensure that files are closed once they have been finished reading.
    • Constructor Detail

      • SimpleFileCollection

        public SimpleFileCollection​(java.util.List<java.lang.String> filelist,
                                    boolean recurse)
        Constructs an instance of the class with the given list of files.
        Parameters:
        filelist - ArrayList the files to be processed by this collection.
      • SimpleFileCollection

        public SimpleFileCollection()
        A default constructor that uses the files to be processed by this collection, as specified by the property collection.spec
      • SimpleFileCollection

        public SimpleFileCollection​(java.util.List<java.lang.String> filelist,
                                    java.lang.String ignored1,
                                    java.lang.String ignored2,
                                    java.lang.String ignored3)
        additional constructors required by TRECIndexing
      • SimpleFileCollection

        public SimpleFileCollection​(java.lang.String addressCollectionFilename,
                                    java.lang.String ignored1,
                                    java.lang.String ignored2,
                                    java.lang.String ignored3)
        additional constructors required by TRECIndexing
      • SimpleFileCollection

        public SimpleFileCollection​(java.lang.String addressCollectionFilename)
        Creates an instance of the class. The files to be processed are specified in the file with the given name.
        Parameters:
        addressCollectionFilename - String the name of the file that contains the list of files to be processed by this collecion.
    • Method Detail

      • createExtensionDocumentMapping

        protected void createExtensionDocumentMapping()
        Parses the properties indexing.simplefilecollection.extensionsparsers and indexing.simplefilecollection.defaultparser and attempts to load all the mentioned classes, in a hashtable mapping filename extension to their respective parsers. If indexing.simplefilecollection.defaultparser is set, then that class will be used to attempt to parse documents that no explicit parser is set.
      • hasNext

        public boolean hasNext()
        Check whether there is a next document in the collection to be processed
        Returns:
        has next
      • next

        public Document next()
        Move onto the next document in the collection to be processed.
        Returns:
        next document
      • remove

        public void remove()
        This is unsupported by this Collection implementation, and any calls will throw UnsupportedOperationException Throws UnsupportedOperationException on all invocations
      • nextDocument

        public boolean nextDocument()
        Move onto the next document in the collection to be processed.
        Specified by:
        nextDocument in interface Collection
        Returns:
        boolean true if there are more documents in the collection, otherwise return false.
      • getDocument

        public Document getDocument()
        Return the current document in the collection.
        Specified by:
        getDocument in interface Collection
        Returns:
        Document the next document object from the collection.
      • makeDocument

        protected Document makeDocument​(java.lang.String Filename,
                                        java.io.InputStream in)
        Given the opened document in, of Filename and File f, work out which parser to try, and instantiate it. If you wish to use a different constructor for opening documents, then you need to subclass this method.
        Parameters:
        Filename - the filename of the currently open document
        in - The stream of the currently open document
        Returns:
        Document object to parse the document, or null if no suitable parser exists.
      • endOfCollection

        public boolean endOfCollection()
        Checks whether there are more documents in the colection.
        Specified by:
        endOfCollection in interface Collection
        Returns:
        boolean true if there are no more documents in the collection, otherwise it returns false.
      • reset

        public void reset()
        Starts again from the beginning of the collection.
        Specified by:
        reset in interface Collection
      • getDocCounter

        public java.lang.String getDocCounter()
        Returns the current document's identifier string.
        Returns:
        String the identifier of the current document.
      • close

        public void close()
        Specified by:
        close in interface java.lang.AutoCloseable
        Specified by:
        close in interface java.io.Closeable
      • getFileList

        public java.util.List<java.lang.String> getFileList()
        Returns the ist of indexed files in the order they were indexed in.
      • addDirectoryListing

        protected void addDirectoryListing()
        Called when thisFile is identified as a directory, this adds the entire contents of the directory onto the list to be processed.