Class CollectionStatistics

  • All Implemented Interfaces:
    java.io.Serializable, org.apache.hadoop.io.Writable
    Direct Known Subclasses:
    MemoryCollectionStatistics, MultiStats, PropertiesIndex.UpdatingCollectionStatistics

    public class CollectionStatistics
    extends java.lang.Object
    implements java.io.Serializable, org.apache.hadoop.io.Writable
    This class provides basic statistics for the indexed collection of documents, such as the average length of documents, or the total number of documents in the collection.
    After indexing, statistics are saved in the PREFIX.log file, along with the classes that should be used for the Lexicon, the DocumentIndex, the DirectIndex and the InvertedIndex. This means that an index knows how it was build and how it should be opened again.
    Author:
    Gianni Amati, Vassilis Plachouras, Craig Macdonald
    See Also:
    Serialized Form
    • Field Detail

      • numberOfFields

        protected int numberOfFields
        Number of fields used to index
      • fieldTokens

        protected long[] fieldTokens
        Number of tokens in each field
      • avgFieldLengths

        protected double[] avgFieldLengths
        Average length of each field
      • fieldNames

        protected java.lang.String[] fieldNames
        Field names
      • numberOfDocuments

        protected int numberOfDocuments
        Total number of documents in the collection.
      • numberOfTokens

        protected long numberOfTokens
        Total number of tokens in the collection.
      • numberOfPointers

        protected long numberOfPointers
        Total number of pointers in the inverted file. This corresponds to the sum of the document frequencies for the terms in the lexicon.
      • numberOfUniqueTerms

        protected int numberOfUniqueTerms
        Total number of unique terms in the collection. This corresponds to the number of entries in the lexicon.
      • averageDocumentLength

        protected double averageDocumentLength
        Average length of a document in the collection.
      • hasPositions

        protected boolean hasPositions
        Does the index have positions
    • Constructor Detail

      • CollectionStatistics

        @Deprecated
        public CollectionStatistics​(int numDocs,
                                    int numTerms,
                                    long numTokens,
                                    long numPointers,
                                    long[] _fieldTokens,
                                    java.lang.String[] _fieldNames)
        Deprecated.
      • CollectionStatistics

        public CollectionStatistics​(int numDocs,
                                    int numTerms,
                                    long numTokens,
                                    long numPointers,
                                    long[] _fieldTokens,
                                    java.lang.String[] _fieldNames,
                                    boolean positions)
        Constructs an instance of the class.
        Parameters:
        numDocs - the number of documents in the collection.
        numTerms - the number of terms in the collection.
        numTokens - the number of tokens in the collection.
        numPointers - the number of pointers in the inverted file.
        _fieldTokens - the number of tokens in each field.
        _fieldNames - the field names.
      • CollectionStatistics

        public CollectionStatistics()
        Default constructor.
    • Method Detail

      • recalculateAverageLengths

        protected void recalculateAverageLengths()
      • toString

        public java.lang.String toString()
        Overrides:
        toString in class java.lang.Object
      • hasPositions

        public boolean hasPositions()
        Returns true if the inverted index will have position informat
      • getAverageDocumentLength

        public double getAverageDocumentLength()
        Returns the documents' average length.
        Returns:
        the average length of the documents in the collection.
      • getNumberOfDocuments

        public int getNumberOfDocuments()
        Returns the total number of documents in the collection.
        Returns:
        the total number of documents in the collection.
      • getNumberOfPointers

        @Deprecated
        public long getNumberOfPointers()
        Deprecated.
        Returns the total number of postings in the collection.
        Returns:
        the total number of postings in the collection.
      • getNumberOfPostings

        public long getNumberOfPostings()
        Returns the total number of postings in the collection.
        Returns:
        the total number of postings in the collection.
      • getNumberOfTokens

        public long getNumberOfTokens()
        Returns the total number of tokens in the collection.
        Returns:
        the total number of tokens in the collection.
      • getNumberOfUniqueTerms

        public int getNumberOfUniqueTerms()
        Returns the total number of unique terms in the lexicon.
        Returns:
        the total number of unique terms in the lexicon.
      • getNumberOfFields

        public int getNumberOfFields()
        Returns the number of fields being used to index.
        Returns:
        the number of fields being used to index.
      • getFieldTokens

        public long[] getFieldTokens()
        Returns the length of each field in tokens.
        Returns:
        the length of each field in tokens.
      • getAverageFieldLengths

        public double[] getAverageFieldLengths()
        Returns the average length of each field in tokens.
        Returns:
        the average length of each field in tokens.
      • getFieldNames

        public java.lang.String[] getFieldNames()
        Returns the field names.
        Returns:
        the field names.
      • addStatistics

        public void addStatistics​(CollectionStatistics cs)
        Increment the collection statistics with the provided collection statistics.
        Parameters:
        cs - the collection statistics to use to increment.
      • readFields

        public void readFields​(java.io.DataInput in)
                        throws java.io.IOException
        Specified by:
        readFields in interface org.apache.hadoop.io.Writable
        Throws:
        java.io.IOException
      • readFieldsV5

        public void readFieldsV5​(java.io.DataInput in)
                          throws java.io.IOException
        Throws:
        java.io.IOException
      • write

        public void write​(java.io.DataOutput out)
                   throws java.io.IOException
        Specified by:
        write in interface org.apache.hadoop.io.Writable
        Throws:
        java.io.IOException