Terrier Core

Single pass indexing tries to merge too many run files at once

Details

  • Type: Bug Bug
  • Status: Open Open
  • Priority: Major Major
  • Resolution: Unresolved
  • Affects Version/s: 2.2.1
  • Fix Version/s: None
  • Component/s: .indexing
  • Description:
    Hide
    Bug was found when running single pass indexing on the Blogs06Collection TREC collection with the String UTF property set. The indexing was carried out overnight as a batch process when this error occurred.

    The jdk1.6.0_10 (32 bit) java compiler was used when the error occurred, while running under the CentOS 5 Linux environment.

    Filtering was applied using stopword removal and stemming with the Porter Stemmer. Block indexing was enabled during indexing.

    {code}
    INFO - Merging 577 runs...
    ERROR - Problem in performMultiWayMerge
    java.io.FileNotFoundException: /path/to/index/data_1Run.500.str (Too many open files)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(FileInputStream.java:106)
    at java.io.FileInputStream.<init>(FileInputStream.java:66)
    at uk.ac.gla.terrier.utility.io.LocalFileSystem.openFileStream(LocalFileSystem.java:97)
    at uk.ac.gla.terrier.utility.Files.openFile(Files.java:199)
    at uk.ac.gla.terrier.utility.Files.openFileStream(Files.java:524)
    at uk.ac.gla.terrier.structures.indexing.singlepass.FileRunIterator.<init>(FileRunIterator.java:67)
    at uk.ac.gla.terrier.structures.indexing.singlepass.FileRunIteratorFactory.createRunIterator(FileRunIteratorFactory.java:44)
    at uk.ac.gla.terrier.structures.indexing.singlepass.RunsMerger.init(RunsMerger.java:177)
    at uk.ac.gla.terrier.structures.indexing.singlepass.RunsMerger.init(RunsMerger.java:170)
    at uk.ac.gla.terrier.structures.indexing.singlepass.RunsMerger.beginMerge(RunsMerger.java:190)
    at uk.ac.gla.terrier.indexing.BasicSinglePassIndexer.performMultiWayMerge(BasicSinglePassIndexer.java:316)
    at uk.ac.gla.terrier.indexing.BasicSinglePassIndexer.createInvertedIndex(BasicSinglePassIndexer.java:238)
    at uk.ac.gla.terrier.indexing.BasicSinglePassIndexer.createDirectIndex(BasicSinglePassIndexer.java:136)
    at uk.ac.gla.terrier.indexing.Indexer.index(Indexer.java:314)
    at uk.ac.gla.terrier.applications.TRECIndexing.createSinglePass(TRECIndexing.java:203)
    at TrecTerrier.run(TrecTerrier.java:399)
    at TrecTerrier.applyOptions(TrecTerrier.java:565)
    at TrecTerrier.main(TrecTerrier.java:244)
    INFO - Collection #0 took 24 seconds to merge
    {code}
    Show
    Bug was found when running single pass indexing on the Blogs06Collection TREC collection with the String UTF property set. The indexing was carried out overnight as a batch process when this error occurred. The jdk1.6.0_10 (32 bit) java compiler was used when the error occurred, while running under the CentOS 5 Linux environment. Filtering was applied using stopword removal and stemming with the Porter Stemmer. Block indexing was enabled during indexing. {code} INFO - Merging 577 runs... ERROR - Problem in performMultiWayMerge java.io.FileNotFoundException: /path/to/index/data_1Run.500.str (Too many open files) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.<init>(FileInputStream.java:106) at java.io.FileInputStream.<init>(FileInputStream.java:66) at uk.ac.gla.terrier.utility.io.LocalFileSystem.openFileStream(LocalFileSystem.java:97) at uk.ac.gla.terrier.utility.Files.openFile(Files.java:199) at uk.ac.gla.terrier.utility.Files.openFileStream(Files.java:524) at uk.ac.gla.terrier.structures.indexing.singlepass.FileRunIterator.<init>(FileRunIterator.java:67) at uk.ac.gla.terrier.structures.indexing.singlepass.FileRunIteratorFactory.createRunIterator(FileRunIteratorFactory.java:44) at uk.ac.gla.terrier.structures.indexing.singlepass.RunsMerger.init(RunsMerger.java:177) at uk.ac.gla.terrier.structures.indexing.singlepass.RunsMerger.init(RunsMerger.java:170) at uk.ac.gla.terrier.structures.indexing.singlepass.RunsMerger.beginMerge(RunsMerger.java:190) at uk.ac.gla.terrier.indexing.BasicSinglePassIndexer.performMultiWayMerge(BasicSinglePassIndexer.java:316) at uk.ac.gla.terrier.indexing.BasicSinglePassIndexer.createInvertedIndex(BasicSinglePassIndexer.java:238) at uk.ac.gla.terrier.indexing.BasicSinglePassIndexer.createDirectIndex(BasicSinglePassIndexer.java:136) at uk.ac.gla.terrier.indexing.Indexer.index(Indexer.java:314) at uk.ac.gla.terrier.applications.TRECIndexing.createSinglePass(TRECIndexing.java:203) at TrecTerrier.run(TrecTerrier.java:399) at TrecTerrier.applyOptions(TrecTerrier.java:565) at TrecTerrier.main(TrecTerrier.java:244) INFO - Collection #0 took 24 seconds to merge {code}

Activity

Hide
Craig Macdonald added a comment - 12/Feb/09 5:24 PM

This is an interesting error that we haven't observed before.

It is of note that indexing Blog06 normally works fine. However, in this case, a combination of several factors caused many runs to be created: (a) memory heap size, (b) block indexing (more runs are created as memory is exhausted quicker) and (c) UTF indexing - causing the strings to take more space.

The central issue is that many operating systems limit the number of file handles available to a given process. You can use ulimit -n to determine how many this is, but only root can change the upper limit.

The resolution to make Terrier more robust is to have the single-pass indexer merge to intermediate files.

Alternatively, there are two work-arounds:

  • The Hadoop version of the single-pass indexer should not suffer from such issues.
  • Alternatively Terrier can index the collection in smaller chunks. See the indexing.max.docs.per.builder property.
Show
Craig Macdonald added a comment - 12/Feb/09 5:24 PM This is an interesting error that we haven't observed before. It is of note that indexing Blog06 normally works fine. However, in this case, a combination of several factors caused many runs to be created: (a) memory heap size, (b) block indexing (more runs are created as memory is exhausted quicker) and (c) UTF indexing - causing the strings to take more space. The central issue is that many operating systems limit the number of file handles available to a given process. You can use ulimit -n to determine how many this is, but only root can change the upper limit. The resolution to make Terrier more robust is to have the single-pass indexer merge to intermediate files. Alternatively, there are two work-arounds:
  • The Hadoop version of the single-pass indexer should not suffer from such issues.
  • Alternatively Terrier can index the collection in smaller chunks. See the indexing.max.docs.per.builder property.
Hide
Edgardo Ambrosi added a comment - 22/Jul/09 9:24 AM

I experimented the same problem indexing the ClueWeb09. The solution that worked quite well was to set the maximum number of open file descriptors to 64000.

The operations to do are the following:

1)loggin as user (for example: bob);
2)type "sudo su"
3)type "ulimit -n" (should be 1024)
4)type "ulimit -n 64000"
5)type "su - bob" (otherwise, if you exit from the root account the previous setting is lost!)
6)run the index process.

It should work!

Bye

Show
Edgardo Ambrosi added a comment - 22/Jul/09 9:24 AM I experimented the same problem indexing the ClueWeb09. The solution that worked quite well was to set the maximum number of open file descriptors to 64000. The operations to do are the following: 1)loggin as user (for example: bob); 2)type "sudo su" 3)type "ulimit -n" (should be 1024) 4)type "ulimit -n 64000" 5)type "su - bob" (otherwise, if you exit from the root account the previous setting is lost!) 6)run the index process. It should work! Bye
Hide
Craig Macdonald added a comment - 18/Feb/11 1:38 PM

I have looked at altering single pass indexing to only merge a few runs files at once. However, this would involve major surgery - the main reason being that the format of the run files differs from that of the output inverted index files. RunWriter only knows how to create RunFiles, RunMerger and PostingInRun only know how to read run files and write inverted files. PostingInRun would instead need to know how to write run files.

Show
Craig Macdonald added a comment - 18/Feb/11 1:38 PM I have looked at altering single pass indexing to only merge a few runs files at once. However, this would involve major surgery - the main reason being that the format of the run files differs from that of the output inverted index files. RunWriter only knows how to create RunFiles, RunMerger and PostingInRun only know how to read run files and write inverted files. PostingInRun would instead need to know how to write run files.

People

Dates

  • Created:
    12/Feb/09 2:46 PM
    Updated:
    18/Feb/11 1:38 PM