Terrier Core

Multiple reducing ends up with a document index and a metaindex for ALL shards

Details

  • Type: Bug Bug
  • Status: Resolved Resolved
  • Priority: Blocker Blocker
  • Resolution: Fixed
  • Affects Version/s: 3.0
  • Fix Version/s: 3.0
  • Component/s: .structures
  1. TREC-45.v1.patch
    (8 kB)
    Craig Macdonald
    12/Aug/09 11:32 AM

Activity

Hide
Craig Macdonald added a comment - 11/Aug/09 8:10 PM

This issue is even more complicated. The reducer uses the side-effect files for two purposes:

  • To determine what document index and metaindex structures need to be merged for its final index
  • To determine what the docid offsets should be in inverted index.

This means that all the docids in the shards are global, not local to the inverted index being created by that shard.

For instance, no docid in the second shard index will be less than the number of documents in the first shard index.

Show
Craig Macdonald added a comment - 11/Aug/09 8:10 PM This issue is even more complicated. The reducer uses the side-effect files for two purposes:
  • To determine what document index and metaindex structures need to be merged for its final index
  • To determine what the docid offsets should be in inverted index.
This means that all the docids in the shards are global, not local to the inverted index being created by that shard. For instance, no docid in the second shard index will be less than the number of documents in the first shard index.
Hide
Craig Macdonald added a comment - 11/Aug/09 8:15 PM

The NWayMergers need to account for the inverted index docid problem.

Show
Craig Macdonald added a comment - 11/Aug/09 8:15 PM The NWayMergers need to account for the inverted index docid problem.
Hide
Craig Macdonald added a comment - 11/Aug/09 10:41 PM

I have two classes in SVN that try to fix this problem for existing indices:

  • FixBadReducerIndex copies the index into a new index, fixing the docids in the inverted file, the collection statistics, and selecting only the appropriate parts of the document index and metaindex along the way.
  • FixDocumentIndexBadReducer just calculates the correct collection statistics.
Show
Craig Macdonald added a comment - 11/Aug/09 10:41 PM I have two classes in SVN that try to fix this problem for existing indices:
  • FixBadReducerIndex copies the index into a new index, fixing the docids in the inverted file, the collection statistics, and selecting only the appropriate parts of the document index and metaindex along the way.
  • FixDocumentIndexBadReducer just calculates the correct collection statistics.
Hide
Craig Macdonald added a comment - 12/Aug/09 11:32 AM

Initial version of a patch for the multi reducer problem.

Show
Craig Macdonald added a comment - 12/Aug/09 11:32 AM Initial version of a patch for the multi reducer problem.
Hide
Craig Macdonald added a comment - 12/Aug/09 7:51 PM

Richard and I checked this, and it does make sense. We're going to try this with for Blogs08 with blocks, as a single reducer doesnt have enough disk space to do this corpus.

Show
Craig Macdonald added a comment - 12/Aug/09 7:51 PM Richard and I checked this, and it does make sense. We're going to try this with for Blogs08 with blocks, as a single reducer doesnt have enough disk space to do this corpus.
Hide
Craig Macdonald added a comment - 19/Aug/09 3:55 PM

Fixed version committed to SVN trunk.

Show
Craig Macdonald added a comment - 19/Aug/09 3:55 PM Fixed version committed to SVN trunk.

People

Dates

  • Created:
    11/Aug/09 5:37 PM
    Updated:
    05/Mar/10 4:56 PM
    Resolved:
    19/Aug/09 3:55 PM