Terrier Core

Hadoop indexing: splits are uneven

Details

  • Type: Bug Bug
  • Status: Resolved Resolved
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: None
  • Fix Version/s: 3.0
  • Component/s: .structures
  • Description:
    For 256 map tasks, and a corpus of 1492 files.

    Split size = 5.8 files each => All but the last split get 5 files each, and the last gets 212 files.

Activity

Hide
Craig Macdonald added a comment - 09/Dec/09 6:49 PM

Resolved, in conjunction with Richard.

Show
Craig Macdonald added a comment - 09/Dec/09 6:49 PM Resolved, in conjunction with Richard.
Hide
Iadh Ounis added a comment - 09/Dec/09 6:51 PM

... and the problem was .....

Just curious (perhaps, I'm trying to find any excuse to stop reading)

Show
Iadh Ounis added a comment - 09/Dec/09 6:51 PM ... and the problem was ..... Just curious (perhaps, I'm trying to find any excuse to stop reading)
Hide
Craig Macdonald added a comment - 09/Dec/09 6:53 PM

Good point.

We were taking the floor of the division, and adding any leftover files to the last split. For large numbers of files, this can become very uneven.

The solution is to take the ceiling of the same division. The downside is that you may end up with slightly less splits than requested.

Show
Craig Macdonald added a comment - 09/Dec/09 6:53 PM Good point. We were taking the floor of the division, and adding any leftover files to the last split. For large numbers of files, this can become very uneven. The solution is to take the ceiling of the same division. The downside is that you may end up with slightly less splits than requested.

People

Dates

  • Created:
    09/Dec/09 6:24 PM
    Updated:
    05/Mar/10 5:30 PM
    Resolved:
    09/Dec/09 6:49 PM