This is an interesting error that we haven't observed before.
It is of note that indexing Blog06 normally works fine. However, in this case, a combination of several factors caused many runs to be created: (a) memory heap size, (b) block indexing (more runs are created as memory is exhausted quicker) and (c) UTF indexing - causing the strings to take more space.
The central issue is that many operating systems limit the number of file handles available to a given process. You can use ulimit -n to determine how many this is, but only root can change the upper limit.
The resolution to make Terrier more robust is to have the single-pass indexer merge to intermediate files.
Alternatively, there are two work-arounds:
- The Hadoop version of the single-pass indexer should not suffer from such issues.
- Alternatively Terrier can index the collection in smaller chunks. See the indexing.max.docs.per.builder property.
This is an interesting error that we haven't observed before.
It is of note that indexing Blog06 normally works fine. However, in this case, a combination of several factors caused many runs to be created: (a) memory heap size, (b) block indexing (more runs are created as memory is exhausted quicker) and (c) UTF indexing - causing the strings to take more space.
The central issue is that many operating systems limit the number of file handles available to a given process. You can use ulimit -n to determine how many this is, but only root can change the upper limit.
The resolution to make Terrier more robust is to have the single-pass indexer merge to intermediate files.
Alternatively, there are two work-arounds: