Class PDFDocument

  • All Implemented Interfaces:
    Document

    public class PDFDocument
    extends FileDocument
    Implements a Document object for reading PDF documents, using Apache PDFBox.
    Author:
    Craig Macdonald
    • Field Detail

      • logger

        protected static final org.slf4j.Logger logger
    • Constructor Detail

      • PDFDocument

        public PDFDocument​(java.lang.String filename,
                           java.io.InputStream docStream,
                           Tokeniser tokeniser)
        Constructs a new PDFDocument, which will convert the docStream which represents the file to a Document object from which an Indexer can retrieve a stream of terms.
        Parameters:
        docStream - InputStream the input stream that represents the the document's file.
      • PDFDocument

        public PDFDocument​(java.io.InputStream docStream,
                           java.util.Map<java.lang.String,​java.lang.String> docProperties,
                           Tokeniser tok)
        Constructs a new PDFDocument
        Parameters:
        docStream -
        docProperties -
        tok -
      • PDFDocument

        public PDFDocument​(java.io.Reader docReader,
                           java.util.Map<java.lang.String,​java.lang.String> docProperties,
                           Tokeniser tok)
        Constructs a new PDFDocument
        Parameters:
        docReader -
        docProperties -
        tok -
      • PDFDocument

        public PDFDocument​(java.lang.String filename,
                           java.io.Reader docReader,
                           Tokeniser tok)
        Constructs a new PDFDocument
        Parameters:
        filename -
        docReader -
        tok -
    • Method Detail

      • getReader

        protected java.io.Reader getReader​(java.io.InputStream is)
        Returns the reader of text, which is suitable for parsing terms out of, and which is created by converting the file represented by parameter docStream. This method involves running the stream through the PDFParser etc provided in the org.pdfbox library. On error, it returns null, and sets EOD to true, so no terms can be read from this document.
        Overrides:
        getReader in class FileDocument
        Parameters:
        is - the input stream that represents the document's file.
        Returns:
        Reader a reader that is fed to an indexer.