Terrier Core

Two pass indexing results in incorrect inverted index

Details

  • Type: Bug Bug
  • Status: Resolved Resolved
  • Priority: Blocker Blocker
  • Resolution: Duplicate
  • Affects Version/s: 3.0
  • Fix Version/s: None
  • Component/s: .indexing, .structures
  • Description:
    Hide
    When using two pass indexing, the resulting inverted index contains wrong entries. Single pass indexing is not affected.

    The error can be reproduced using this example:

    = doc0.txt =
    cats dogs horses

    = doc1.txt =
    chicken cats chicken chicken

    = Program =
    List<String> files = new ArrayList<String>();
    files.add( "doc0.txt" );
    files.add( "doc1.txt" );

    /* two pass */
    Collection col = new SimpleFileCollection( files, false );
    Collection[] collections = new Collection[] { col };

    Indexer indexer = new BasicIndexer( ApplicationSetup.TERRIER_INDEX_PATH, "test_filecollection" );
    indexer.createDirectIndex( collections );
    indexer.createInvertedIndex();

    Index index = Index.createIndex( "index", "test_filecollection" );
    Lexicon<String> lexicon = index.getLexicon();
    InvertedIndex invertedIndex = index.getInvertedIndex();

    LexiconEntry chickenEntry = lexicon.getLexiconEntry( "chicken" );
    int[][] docs = invertedIndex.getDocuments( chickenEntry );

    System.out.println( "docs[0]: " + Arrays.toString( docs[0] ) );
    System.out.println( "docs[1]: " + Arrays.toString( docs[1] ) );
    System.out.println( "docno of docs[0][0]: " + index.getMetaIndex().getItem( "docno", docs[0][0] ) );

    /* single pass */
    col = new SimpleFileCollection( files, false );
    collections = new Collection[] { col };

    BasicSinglePassIndexer singlePassIndexer = new BasicSinglePassIndexer(
    ApplicationSetup.TERRIER_INDEX_PATH, "test_filecollection_singlepass" );
    singlePassIndexer.createInvertedIndex( collections );

    index = Index.createIndex( "index", "test_filecollection_singlepass" );
    lexicon = index.getLexicon();
    invertedIndex = index.getInvertedIndex();

    chickenEntry = lexicon.getLexiconEntry( "chicken" );
    docs = invertedIndex.getDocuments( chickenEntry );

    System.out.println( "docs[0]: " + Arrays.toString( docs[0] ) );
    System.out.println( "docs[1]: " + Arrays.toString( docs[1] ) );
    System.out.println( "docno of docs[0][0]: " + index.getMetaIndex().getItem( "docno", docs[0][0] ) );

    = Output =
    docs[0]: [0]
    docs[1]: [1]
    docno of docs[0][0]: 1

    docs[0]: [1]
    docs[1]: [3]
    docno of docs[0][0]: 2
    Show
    When using two pass indexing, the resulting inverted index contains wrong entries. Single pass indexing is not affected. The error can be reproduced using this example: = doc0.txt = cats dogs horses = doc1.txt = chicken cats chicken chicken = Program = List<String> files = new ArrayList<String>(); files.add( "doc0.txt" ); files.add( "doc1.txt" ); /* two pass */ Collection col = new SimpleFileCollection( files, false ); Collection[] collections = new Collection[] { col }; Indexer indexer = new BasicIndexer( ApplicationSetup.TERRIER_INDEX_PATH, "test_filecollection" ); indexer.createDirectIndex( collections ); indexer.createInvertedIndex(); Index index = Index.createIndex( "index", "test_filecollection" ); Lexicon<String> lexicon = index.getLexicon(); InvertedIndex invertedIndex = index.getInvertedIndex(); LexiconEntry chickenEntry = lexicon.getLexiconEntry( "chicken" ); int[][] docs = invertedIndex.getDocuments( chickenEntry ); System.out.println( "docs[0]: " + Arrays.toString( docs[0] ) ); System.out.println( "docs[1]: " + Arrays.toString( docs[1] ) ); System.out.println( "docno of docs[0][0]: " + index.getMetaIndex().getItem( "docno", docs[0][0] ) ); /* single pass */ col = new SimpleFileCollection( files, false ); collections = new Collection[] { col }; BasicSinglePassIndexer singlePassIndexer = new BasicSinglePassIndexer( ApplicationSetup.TERRIER_INDEX_PATH, "test_filecollection_singlepass" ); singlePassIndexer.createInvertedIndex( collections ); index = Index.createIndex( "index", "test_filecollection_singlepass" ); lexicon = index.getLexicon(); invertedIndex = index.getInvertedIndex(); chickenEntry = lexicon.getLexiconEntry( "chicken" ); docs = invertedIndex.getDocuments( chickenEntry ); System.out.println( "docs[0]: " + Arrays.toString( docs[0] ) ); System.out.println( "docs[1]: " + Arrays.toString( docs[1] ) ); System.out.println( "docno of docs[0][0]: " + index.getMetaIndex().getItem( "docno", docs[0][0] ) ); = Output = docs[0]: [0] docs[1]: [1] docno of docs[0][0]: 1 docs[0]: [1] docs[1]: [3] docno of docs[0][0]: 2

Activity

Hide
Craig Macdonald added a comment - 08/Jul/10 12:57 PM

Thanks for the report. I will investigate shortly.

Show
Craig Macdonald added a comment - 08/Jul/10 12:57 PM Thanks for the report. I will investigate shortly.
Hide
Craig Macdonald added a comment - 08/Jul/10 3:36 PM

Sorry, I can't reproduce this. I tried both with trunk, and a virgin copy of Terrier 3.0. Can you try also with a virgin Terrier 3.0?

INFO - NEXT: doc0.txt
INFO - NEXT: doc1.txt
INFO - Collection #0 took 0 seconds to index (2 documents)
INFO - Key docno values are sorted in meta index, consider binary searching zdata file
INFO - 1 lexicons to merge
INFO - Optimising structure lexicon
Optimsing lexicon with 4 entries
INFO - Started building the inverted index...
INFO - Started building the inverted index...
INFO - Iteration 1 of 1 iterations
INFO - Optimising structure lexicon
Optimsing lexicon with 4 entries
INFO - Finished building the inverted index...
INFO - Time elapsed for inverted file: 0
INFO - Structure meta reading lookup file into memory
INFO - Structure meta reading reverse map for key docno directly from disk
INFO - Structure meta loading data file into memory
docs[0]: [1]
docs[1]: [3]
docno of docs[0][0]: 2
INFO - Creating IF (no direct file)..
INFO - NEXT: doc0.txt
INFO - NEXT: doc1.txt
INFO - Collection #0 took 0 seconds to build the runs for 2 documents

INFO - Key docno values are sorted in meta index, consider binary searching zdata file
INFO - Merging 1 runs...
INFO - Collection #0 took 0 seconds to merge
 
INFO - Collection #0 total time 0
INFO - Optimising structure lexicon
Optimsing lexicon with 4 entries
All ids for structure lexicon are aligned, skipping .fsomapid file
INFO - Structure meta reading lookup file into memory
INFO - Structure meta reading reverse map for key docno directly from disk
INFO - Structure meta loading data file into memory
docs[0]: [1]
docs[1]: [3]
docno of docs[0][0]: 2
Show
Craig Macdonald added a comment - 08/Jul/10 3:36 PM Sorry, I can't reproduce this. I tried both with trunk, and a virgin copy of Terrier 3.0. Can you try also with a virgin Terrier 3.0?
INFO - NEXT: doc0.txt
INFO - NEXT: doc1.txt
INFO - Collection #0 took 0 seconds to index (2 documents)
INFO - Key docno values are sorted in meta index, consider binary searching zdata file
INFO - 1 lexicons to merge
INFO - Optimising structure lexicon
Optimsing lexicon with 4 entries
INFO - Started building the inverted index...
INFO - Started building the inverted index...
INFO - Iteration 1 of 1 iterations
INFO - Optimising structure lexicon
Optimsing lexicon with 4 entries
INFO - Finished building the inverted index...
INFO - Time elapsed for inverted file: 0
INFO - Structure meta reading lookup file into memory
INFO - Structure meta reading reverse map for key docno directly from disk
INFO - Structure meta loading data file into memory
docs[0]: [1]
docs[1]: [3]
docno of docs[0][0]: 2
INFO - Creating IF (no direct file)..
INFO - NEXT: doc0.txt
INFO - NEXT: doc1.txt
INFO - Collection #0 took 0 seconds to build the runs for 2 documents

INFO - Key docno values are sorted in meta index, consider binary searching zdata file
INFO - Merging 1 runs...
INFO - Collection #0 took 0 seconds to merge
 
INFO - Collection #0 total time 0
INFO - Optimising structure lexicon
Optimsing lexicon with 4 entries
All ids for structure lexicon are aligned, skipping .fsomapid file
INFO - Structure meta reading lookup file into memory
INFO - Structure meta reading reverse map for key docno directly from disk
INFO - Structure meta loading data file into memory
docs[0]: [1]
docs[1]: [3]
docno of docs[0][0]: 2
Hide
Philipp Sorg added a comment - 09/Jul/10 11:01 AM

I tried again using a virgin copy of Terrier 3.0 and also ran the test on a Linux server.

On the server (Debian, x64) the results are correct. However on my desktop (Windows 7, x64) the error still remains. Seems to be a platform specific problem.

Show
Philipp Sorg added a comment - 09/Jul/10 11:01 AM I tried again using a virgin copy of Terrier 3.0 and also ran the test on a Linux server. On the server (Debian, x64) the results are correct. However on my desktop (Windows 7, x64) the error still remains. Seems to be a platform specific problem.
Hide
Craig Macdonald added a comment - 09/Jul/10 11:03 AM

Ah, now I understand. See TR-116 for a file not being closed issue. If this turns out to be the problem, then I'll close this issue as a duplicate.

Show
Craig Macdonald added a comment - 09/Jul/10 11:03 AM Ah, now I understand. See TR-116 for a file not being closed issue. If this turns out to be the problem, then I'll close this issue as a duplicate.
Hide
Philipp Sorg added a comment - 09/Jul/10 12:23 PM

The patch for TR166 fixes the problem, this bug is a duplicate.

Show
Philipp Sorg added a comment - 09/Jul/10 12:23 PM The patch for TR166 fixes the problem, this bug is a duplicate.
Hide
Craig Macdonald added a comment - 09/Jul/10 1:06 PM

Duplicate of TR-116. Thanks for raising the issue Philipp.

Show
Craig Macdonald added a comment - 09/Jul/10 1:06 PM Duplicate of TR-116. Thanks for raising the issue Philipp.

People

Dates

  • Created:
    08/Jul/10 8:53 AM
    Updated:
    09/Jul/10 1:06 PM
    Resolved:
    09/Jul/10 1:06 PM