Terrier Core

Partitioned Mode fails unexpectedly due to missing run status files

Details

  • Type: Bug Bug
  • Status: Closed Closed
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: 2.2
  • Fix Version/s: 2.2.1
  • Component/s: .indexing
  • Description:
    Hide
    Partitioning Mode likely does not work as it loses necessary run status files.

    Possible Cause:
    attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/IP:54476 remote=/IP:50010]
    attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_5895289464510919755_580904
    attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010
    attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/130.209.249.49:54483 remote=/130.209.249.49:50010]
    attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_1308099895179166256_580904
    attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010
    attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink IP:50010
    attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_4491420624309706092_580904
    attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010
    attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink IP:50010
    attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_-827893635345658476_580906
    attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010
    attempt_200901201748_0001_r_000000_0: WARN - Error running child
    attempt_200901201748_0001_r_000000_0: java.io.IOException: Could not load index from (hdfs://master:9000/user/richardm/mapred-12-08_E1_3,task_200901201748_0001_m_000000) because Index not found: hdfs://master:9000/user/richardm/mapred-12-08_E1_3/task_200901201748_0001_m_000000.properties and hdfs://trmaster:9000/user/richardm/mapred-12-08_E1_3/task_200901201748_0001_m_000000.log both not found.
    attempt_200901201748_0001_r_000000_0: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.closeReduce(Hadoop_BasicSinglePassIndexer.java:529)
    attempt_200901201748_0001_r_000000_0: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.close(Hadoop_BasicSinglePassIndexer.java:160)
    attempt_200901201748_0001_r_000000_0: at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:324)
    attempt_200901201748_0001_r_000000_0: at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

    Effect:
    attempt_200901201748_0001_r_000000_2: java.io.IOException: No run status files found in hdfs://master:9000/user/richardm/mapred-12-08_E1_3
    attempt_200901201748_0001_r_000000_2: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.loadRunData(Hadoop_BasicSinglePassIndexer.java:393)
    attempt_200901201748_0001_r_000000_2: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.reduce(Hadoop_BasicSinglePassIndexer.java:452)
    attempt_200901201748_0001_r_000000_2: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.reduce(Hadoop_BasicSinglePassIndexer.java:97)
    attempt_200901201748_0001_r_000000_2: at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
    attempt_200901201748_0001_r_000000_2: at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
    Show
    Partitioning Mode likely does not work as it loses necessary run status files. Possible Cause: attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/IP:54476 remote=/IP:50010] attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_5895289464510919755_580904 attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010 attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/130.209.249.49:54483 remote=/130.209.249.49:50010] attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_1308099895179166256_580904 attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010 attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink IP:50010 attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_4491420624309706092_580904 attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010 attempt_200901201748_0001_r_000000_0: INFO - Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink IP:50010 attempt_200901201748_0001_r_000000_0: INFO - Abandoning block blk_-827893635345658476_580906 attempt_200901201748_0001_r_000000_0: INFO - Waiting to find target node: IP:50010 attempt_200901201748_0001_r_000000_0: WARN - Error running child attempt_200901201748_0001_r_000000_0: java.io.IOException: Could not load index from (hdfs://master:9000/user/richardm/mapred-12-08_E1_3,task_200901201748_0001_m_000000) because Index not found: hdfs://master:9000/user/richardm/mapred-12-08_E1_3/task_200901201748_0001_m_000000.properties and hdfs://trmaster:9000/user/richardm/mapred-12-08_E1_3/task_200901201748_0001_m_000000.log both not found. attempt_200901201748_0001_r_000000_0: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.closeReduce(Hadoop_BasicSinglePassIndexer.java:529) attempt_200901201748_0001_r_000000_0: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.close(Hadoop_BasicSinglePassIndexer.java:160) attempt_200901201748_0001_r_000000_0: at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:324) attempt_200901201748_0001_r_000000_0: at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Effect: attempt_200901201748_0001_r_000000_2: java.io.IOException: No run status files found in hdfs://master:9000/user/richardm/mapred-12-08_E1_3 attempt_200901201748_0001_r_000000_2: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.loadRunData(Hadoop_BasicSinglePassIndexer.java:393) attempt_200901201748_0001_r_000000_2: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.reduce(Hadoop_BasicSinglePassIndexer.java:452) attempt_200901201748_0001_r_000000_2: at uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer.reduce(Hadoop_BasicSinglePassIndexer.java:97) attempt_200901201748_0001_r_000000_2: at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318) attempt_200901201748_0001_r_000000_2: at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
  1. partitionModePatch.patch
    (3 kB)
    Craig Macdonald
    22/Jan/09 11:13 PM
  2. partitionModePatch.v2.patch
    (4 kB)
    Craig Macdonald
    28/Jan/09 9:30 PM

Activity

Hide
Craig Macdonald added a comment - 22/Jan/09 12:58 PM

This looks like a datanode timeout rather than a Terrier problem?

Show
Craig Macdonald added a comment - 22/Jan/09 12:58 PM This looks like a datanode timeout rather than a Terrier problem?
Hide
Richard McCreadie added a comment - 22/Jan/09 6:57 PM

The first error was a standard DFS busy timeout.

The actual error was caused by one reduce deleting all the files the other reducers needed to run.

We may just want to hold off deleting those files until the whole job has finished.

Show
Richard McCreadie added a comment - 22/Jan/09 6:57 PM The first error was a standard DFS busy timeout. The actual error was caused by one reduce deleting all the files the other reducers needed to run. We may just want to hold off deleting those files until the whole job has finished.
Hide
Craig Macdonald added a comment - 22/Jan/09 11:13 PM

In Hadoop 0.19, there is an OutputCommitter API that would let us cleanup after the job completes.

The attached patch fixes the problem for Hadoop 0.18, by deleting all files starting with a taskid matching the jobid when the job ends.

Show
Craig Macdonald added a comment - 22/Jan/09 11:13 PM In Hadoop 0.19, there is an OutputCommitter API that would let us cleanup after the job completes. The attached patch fixes the problem for Hadoop 0.18, by deleting all files starting with a taskid matching the jobid when the job ends.
Hide
Craig Macdonald added a comment - 28/Jan/09 9:30 PM

Tested patch

Show
Craig Macdonald added a comment - 28/Jan/09 9:30 PM Tested patch
Hide
Craig Macdonald added a comment - 28/Jan/09 9:32 PM

Committed.

Show
Craig Macdonald added a comment - 28/Jan/09 9:32 PM Committed.

People

Dates

  • Created:
    21/Jan/09 5:23 PM
    Updated:
    29/Jan/09 7:33 PM
    Resolved:
    28/Jan/09 9:32 PM