Terrier Core

TwitterJSONCollection doesn't work with Hadoop plug-in

Details

  • Type: Bug Bug
  • Status: Resolved Resolved
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: 3.5
  • Fix Version/s: None
  • Component/s: .indexing
  • Description:
    Hide
    Hi,

    I am trying to use Terrier on Hadoop to index Tweet11 collection by using the plug-in TR-171. However, it doesn't work out of box. It seems that TwitterJSONCollection lacks a constructor over InputStream. I coded such a constructor for it, but still there are some other problems. I run Terrier-3.5 on a pre-build Cloudera VM. The error message shows as follows:

    ---------------------------------------------------------------

    Setting TERRIER_HOME to /home/cloudera/terrier-3.5
    INFO - Term-partitioned Mode, 26 reducers creating one inverted index.
    INFO - Copying terrier share/ directory (/home/cloudera/terrier-3.5/share) to shared storage area (hdfs://localhost.localdomain/tmp/702454286-terrier.share)
    INFO - Copying classpath to job
    INFO - Put indices into: hdfs://localhost.localdomain:8020/user/cloudera/index/tweet11
    WARN - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    WARN - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
    INFO - Allocating 1 files across 1 map tasks
    INFO - Running job: job_201201201144_0020
    INFO - map 0% reduce 0%
    INFO - Task Id : attempt_201201201144_0020_m_000000_0, Status : FAILED
    java.lang.NullPointerException
    at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94)
    at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)

    attempt_201201201144_0020_m_000000_0: WARNING: The file terrier.properties was not found at location /home/cloudera/terrier-3.5/etc/terrier.properties
    attempt_201201201144_0020_m_000000_0: Assuming the value of terrier.home from the corresponding system property.
    attempt_201201201144_0020_m_000000_0: 0
    attempt_201201201144_0020_m_000000_0: WARN - Snappy native library is available
    attempt_201201201144_0020_m_000000_0: INFO - Snappy native library loaded
    attempt_201201201144_0020_m_000000_0: INFO - numReduceTasks: 26
    attempt_201201201144_0020_m_000000_0: INFO - io.sort.mb = 100
    attempt_201201201144_0020_m_000000_0: INFO - data buffer = 79691776/99614720
    attempt_201201201144_0020_m_000000_0: INFO - record buffer = 262144/327680
    attempt_201201201144_0020_m_000000_0: INFO - Reloading Application Setup
    attempt_201201201144_0020_m_000000_0: INFO - Checking memory usage every 20 maxDocPerFlush=0
    attempt_201201201144_0020_m_000000_0: INFO - Opening hdfs://localhost.localdomain:8020/user/cloudera/twitter-test.ljson.gz
    attempt_201201201144_0020_m_000000_0: INFO - Successfully loaded & initialized native-zlib library
    attempt_201201201144_0020_m_000000_0: INFO - Map task_201201201144_0020_m_000000, flush requested, containing 100 documents, flush 0
    attempt_201201201144_0020_m_000000_0: INFO - Map task_201201201144_0020_m_000000 finishing, indexed 100 in 0 flushes
    attempt_201201201144_0020_m_000000_0: WARN - Error running child
    attempt_201201201144_0020_m_000000_0: java.lang.NullPointerException
    attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94)
    attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260)
    attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
    attempt_201201201144_0020_m_000000_0: at java.security.AccessController.doPrivileged(Native Method)
    attempt_201201201144_0020_m_000000_0: at javax.security.auth.Subject.doAs(Subject.java:396)
    attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
    attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.Child.main(Child.java:264)
    attempt_201201201144_0020_m_000000_0: INFO - Runnning cleanup for the task
    INFO - Task Id : attempt_201201201144_0020_m_000000_1, Status : FAILED
    java.lang.NullPointerException
    at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94)
    at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)

    attempt_201201201144_0020_m_000000_1: WARNING: The file terrier.properties was not found at location /home/cloudera/terrier-3.5/etc/terrier.properties
    attempt_201201201144_0020_m_000000_1: Assuming the value of terrier.home from the corresponding system property.
    attempt_201201201144_0020_m_000000_1: 0
    attempt_201201201144_0020_m_000000_1: WARN - Snappy native library is available
    attempt_201201201144_0020_m_000000_1: INFO - Snappy native library loaded
    attempt_201201201144_0020_m_000000_1: INFO - numReduceTasks: 26
    attempt_201201201144_0020_m_000000_1: INFO - io.sort.mb = 100
    attempt_201201201144_0020_m_000000_1: INFO - data buffer = 79691776/99614720
    attempt_201201201144_0020_m_000000_1: INFO - record buffer = 262144/327680
    attempt_201201201144_0020_m_000000_1: INFO - Reloading Application Setup
    attempt_201201201144_0020_m_000000_1: INFO - Checking memory usage every 20 maxDocPerFlush=0
    attempt_201201201144_0020_m_000000_1: INFO - Opening hdfs://localhost.localdomain:8020/user/cloudera/twitter-test.ljson.gz
    attempt_201201201144_0020_m_000000_1: INFO - Successfully loaded & initialized native-zlib library
    attempt_201201201144_0020_m_000000_1: INFO - Map task_201201201144_0020_m_000000, flush requested, containing 100 documents, flush 0
    attempt_201201201144_0020_m_000000_1: INFO - Map task_201201201144_0020_m_000000 finishing, indexed 100 in 0 flushes
    attempt_201201201144_0020_m_000000_1: WARN - Error running child
    attempt_201201201144_0020_m_000000_1: java.lang.NullPointerException
    attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94)
    attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260)
    attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
    attempt_201201201144_0020_m_000000_1: at java.security.AccessController.doPrivileged(Native Method)
    attempt_201201201144_0020_m_000000_1: at javax.security.auth.Subject.doAs(Subject.java:396)
    attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
    attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.Child.main(Child.java:264)
    attempt_201201201144_0020_m_000000_1: INFO - Runnning cleanup for the task
    INFO - Task Id : attempt_201201201144_0020_m_000000_2, Status : FAILED
    java.lang.NullPointerException
    at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94)
    at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)

    attempt_201201201144_0020_m_000000_2: WARNING: The file terrier.properties was not found at location /home/cloudera/terrier-3.5/etc/terrier.properties
    attempt_201201201144_0020_m_000000_2: Assuming the value of terrier.home from the corresponding system property.
    attempt_201201201144_0020_m_000000_2: 0
    attempt_201201201144_0020_m_000000_2: WARN - Snappy native library is available
    attempt_201201201144_0020_m_000000_2: INFO - Snappy native library loaded
    attempt_201201201144_0020_m_000000_2: INFO - numReduceTasks: 26
    attempt_201201201144_0020_m_000000_2: INFO - io.sort.mb = 100
    attempt_201201201144_0020_m_000000_2: INFO - data buffer = 79691776/99614720
    attempt_201201201144_0020_m_000000_2: INFO - record buffer = 262144/327680
    attempt_201201201144_0020_m_000000_2: INFO - Reloading Application Setup
    attempt_201201201144_0020_m_000000_2: INFO - Checking memory usage every 20 maxDocPerFlush=0
    attempt_201201201144_0020_m_000000_2: INFO - Opening hdfs://localhost.localdomain:8020/user/cloudera/twitter-test.ljson.gz
    attempt_201201201144_0020_m_000000_2: INFO - Successfully loaded & initialized native-zlib library
    attempt_201201201144_0020_m_000000_2: INFO - Map task_201201201144_0020_m_000000, flush requested, containing 100 documents, flush 0
    attempt_201201201144_0020_m_000000_2: INFO - Map task_201201201144_0020_m_000000 finishing, indexed 100 in 0 flushes
    attempt_201201201144_0020_m_000000_2: WARN - Error running child
    attempt_201201201144_0020_m_000000_2: java.lang.NullPointerException
    attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94)
    attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260)
    attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
    attempt_201201201144_0020_m_000000_2: at java.security.AccessController.doPrivileged(Native Method)
    attempt_201201201144_0020_m_000000_2: at javax.security.auth.Subject.doAs(Subject.java:396)
    attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
    attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.Child.main(Child.java:264)
    attempt_201201201144_0020_m_000000_2: INFO - Runnning cleanup for the task
    INFO - Job complete: job_201201201144_0020
    INFO - Counters: 7
    INFO - Job Counters
    INFO - SLOTS_MILLIS_MAPS=31057
    INFO - Total time spent by all reduces waiting after reserving slots (ms)=0
    INFO - Total time spent by all maps waiting after reserving slots (ms)=0
    INFO - Launched map tasks=4
    INFO - Data-local map tasks=4
    INFO - SLOTS_MILLIS_REDUCES=0
    INFO - Failed map tasks=1
    INFO - Job Failed: NA
    ERROR - Problem running job
    java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1246)
    at org.terrier.applications.HadoopIndexing.main(HadoopIndexing.java:231)
    at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:376)
    at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:569)
    at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:237)
    Time Taken = 39 seconds
    Time elapsed: 39.936 seconds.

    ----------------------------------------------------------------------

    The changes I made are as follows:

    ----------------------------------------------------------------------
    --- <unnamed>
    +++ <unnamed>
    @@ -28,6 +28,7 @@
    import java.io.BufferedReader;
    import java.io.FileInputStream;
    import java.io.IOException;
    +import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.util.ArrayList;
    import java.util.zip.GZIPInputStream;
    @@ -81,6 +82,11 @@
    } catch (IOException ioe) {
    logger.error("IOException opening first file of collection - is the collection.spec correct?", ioe);
    }
    + }
    +
    + public TwitterJSONCollection(InputStream instream) {
    + currentTweetStream = new BufferedReader(new InputStreamReader(instream));
    + JSONStream = new JsonStreamParser(currentTweetStream);
    }

    public TwitterJSONCollection() {}
    @@ -187,7 +193,7 @@

    @Override
    public boolean nextDocument() {
    - if (FilesToProcess==null) init();
    +// if (JSONStream==null) init();
    if (JSONStream.hasNext()) {
    currentDocument = new TwitterJSONDocument(readTweet());
    return true;

    ---------------------------

    Thanks for help. BTW, should I put it here or on the issue tracker.
    Show
    Hi, I am trying to use Terrier on Hadoop to index Tweet11 collection by using the plug-in TR-171. However, it doesn't work out of box. It seems that TwitterJSONCollection lacks a constructor over InputStream. I coded such a constructor for it, but still there are some other problems. I run Terrier-3.5 on a pre-build Cloudera VM. The error message shows as follows: --------------------------------------------------------------- Setting TERRIER_HOME to /home/cloudera/terrier-3.5 INFO - Term-partitioned Mode, 26 reducers creating one inverted index. INFO - Copying terrier share/ directory (/home/cloudera/terrier-3.5/share) to shared storage area (hdfs://localhost.localdomain/tmp/702454286-terrier.share) INFO - Copying classpath to job INFO - Put indices into: hdfs://localhost.localdomain:8020/user/cloudera/index/tweet11 WARN - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. WARN - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). INFO - Allocating 1 files across 1 map tasks INFO - Running job: job_201201201144_0020 INFO - map 0% reduce 0% INFO - Task Id : attempt_201201201144_0020_m_000000_0, Status : FAILED java.lang.NullPointerException at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260) at org.apache.hadoop.mapred.Child$4.run(Child.java:272) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) attempt_201201201144_0020_m_000000_0: WARNING: The file terrier.properties was not found at location /home/cloudera/terrier-3.5/etc/terrier.properties attempt_201201201144_0020_m_000000_0: Assuming the value of terrier.home from the corresponding system property. attempt_201201201144_0020_m_000000_0: 0 attempt_201201201144_0020_m_000000_0: WARN - Snappy native library is available attempt_201201201144_0020_m_000000_0: INFO - Snappy native library loaded attempt_201201201144_0020_m_000000_0: INFO - numReduceTasks: 26 attempt_201201201144_0020_m_000000_0: INFO - io.sort.mb = 100 attempt_201201201144_0020_m_000000_0: INFO - data buffer = 79691776/99614720 attempt_201201201144_0020_m_000000_0: INFO - record buffer = 262144/327680 attempt_201201201144_0020_m_000000_0: INFO - Reloading Application Setup attempt_201201201144_0020_m_000000_0: INFO - Checking memory usage every 20 maxDocPerFlush=0 attempt_201201201144_0020_m_000000_0: INFO - Opening hdfs://localhost.localdomain:8020/user/cloudera/twitter-test.ljson.gz attempt_201201201144_0020_m_000000_0: INFO - Successfully loaded & initialized native-zlib library attempt_201201201144_0020_m_000000_0: INFO - Map task_201201201144_0020_m_000000, flush requested, containing 100 documents, flush 0 attempt_201201201144_0020_m_000000_0: INFO - Map task_201201201144_0020_m_000000 finishing, indexed 100 in 0 flushes attempt_201201201144_0020_m_000000_0: WARN - Error running child attempt_201201201144_0020_m_000000_0: java.lang.NullPointerException attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94) attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260) attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.Child$4.run(Child.java:272) attempt_201201201144_0020_m_000000_0: at java.security.AccessController.doPrivileged(Native Method) attempt_201201201144_0020_m_000000_0: at javax.security.auth.Subject.doAs(Subject.java:396) attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.Child.main(Child.java:264) attempt_201201201144_0020_m_000000_0: INFO - Runnning cleanup for the task INFO - Task Id : attempt_201201201144_0020_m_000000_1, Status : FAILED java.lang.NullPointerException at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260) at org.apache.hadoop.mapred.Child$4.run(Child.java:272) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) attempt_201201201144_0020_m_000000_1: WARNING: The file terrier.properties was not found at location /home/cloudera/terrier-3.5/etc/terrier.properties attempt_201201201144_0020_m_000000_1: Assuming the value of terrier.home from the corresponding system property. attempt_201201201144_0020_m_000000_1: 0 attempt_201201201144_0020_m_000000_1: WARN - Snappy native library is available attempt_201201201144_0020_m_000000_1: INFO - Snappy native library loaded attempt_201201201144_0020_m_000000_1: INFO - numReduceTasks: 26 attempt_201201201144_0020_m_000000_1: INFO - io.sort.mb = 100 attempt_201201201144_0020_m_000000_1: INFO - data buffer = 79691776/99614720 attempt_201201201144_0020_m_000000_1: INFO - record buffer = 262144/327680 attempt_201201201144_0020_m_000000_1: INFO - Reloading Application Setup attempt_201201201144_0020_m_000000_1: INFO - Checking memory usage every 20 maxDocPerFlush=0 attempt_201201201144_0020_m_000000_1: INFO - Opening hdfs://localhost.localdomain:8020/user/cloudera/twitter-test.ljson.gz attempt_201201201144_0020_m_000000_1: INFO - Successfully loaded & initialized native-zlib library attempt_201201201144_0020_m_000000_1: INFO - Map task_201201201144_0020_m_000000, flush requested, containing 100 documents, flush 0 attempt_201201201144_0020_m_000000_1: INFO - Map task_201201201144_0020_m_000000 finishing, indexed 100 in 0 flushes attempt_201201201144_0020_m_000000_1: WARN - Error running child attempt_201201201144_0020_m_000000_1: java.lang.NullPointerException attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94) attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260) attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.Child$4.run(Child.java:272) attempt_201201201144_0020_m_000000_1: at java.security.AccessController.doPrivileged(Native Method) attempt_201201201144_0020_m_000000_1: at javax.security.auth.Subject.doAs(Subject.java:396) attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.Child.main(Child.java:264) attempt_201201201144_0020_m_000000_1: INFO - Runnning cleanup for the task INFO - Task Id : attempt_201201201144_0020_m_000000_2, Status : FAILED java.lang.NullPointerException at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260) at org.apache.hadoop.mapred.Child$4.run(Child.java:272) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) attempt_201201201144_0020_m_000000_2: WARNING: The file terrier.properties was not found at location /home/cloudera/terrier-3.5/etc/terrier.properties attempt_201201201144_0020_m_000000_2: Assuming the value of terrier.home from the corresponding system property. attempt_201201201144_0020_m_000000_2: 0 attempt_201201201144_0020_m_000000_2: WARN - Snappy native library is available attempt_201201201144_0020_m_000000_2: INFO - Snappy native library loaded attempt_201201201144_0020_m_000000_2: INFO - numReduceTasks: 26 attempt_201201201144_0020_m_000000_2: INFO - io.sort.mb = 100 attempt_201201201144_0020_m_000000_2: INFO - data buffer = 79691776/99614720 attempt_201201201144_0020_m_000000_2: INFO - record buffer = 262144/327680 attempt_201201201144_0020_m_000000_2: INFO - Reloading Application Setup attempt_201201201144_0020_m_000000_2: INFO - Checking memory usage every 20 maxDocPerFlush=0 attempt_201201201144_0020_m_000000_2: INFO - Opening hdfs://localhost.localdomain:8020/user/cloudera/twitter-test.ljson.gz attempt_201201201144_0020_m_000000_2: INFO - Successfully loaded & initialized native-zlib library attempt_201201201144_0020_m_000000_2: INFO - Map task_201201201144_0020_m_000000, flush requested, containing 100 documents, flush 0 attempt_201201201144_0020_m_000000_2: INFO - Map task_201201201144_0020_m_000000 finishing, indexed 100 in 0 flushes attempt_201201201144_0020_m_000000_2: WARN - Error running child attempt_201201201144_0020_m_000000_2: java.lang.NullPointerException attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94) attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260) attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.Child$4.run(Child.java:272) attempt_201201201144_0020_m_000000_2: at java.security.AccessController.doPrivileged(Native Method) attempt_201201201144_0020_m_000000_2: at javax.security.auth.Subject.doAs(Subject.java:396) attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.Child.main(Child.java:264) attempt_201201201144_0020_m_000000_2: INFO - Runnning cleanup for the task INFO - Job complete: job_201201201144_0020 INFO - Counters: 7 INFO - Job Counters INFO - SLOTS_MILLIS_MAPS=31057 INFO - Total time spent by all reduces waiting after reserving slots (ms)=0 INFO - Total time spent by all maps waiting after reserving slots (ms)=0 INFO - Launched map tasks=4 INFO - Data-local map tasks=4 INFO - SLOTS_MILLIS_REDUCES=0 INFO - Failed map tasks=1 INFO - Job Failed: NA ERROR - Problem running job java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1246) at org.terrier.applications.HadoopIndexing.main(HadoopIndexing.java:231) at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:376) at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:569) at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:237) Time Taken = 39 seconds Time elapsed: 39.936 seconds. ---------------------------------------------------------------------- The changes I made are as follows: ---------------------------------------------------------------------- --- <unnamed> +++ <unnamed> @@ -28,6 +28,7 @@ import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; +import java.io.InputStream; import java.io.InputStreamReader; import java.util.ArrayList; import java.util.zip.GZIPInputStream; @@ -81,6 +82,11 @@ } catch (IOException ioe) { logger.error("IOException opening first file of collection - is the collection.spec correct?", ioe); } + } + + public TwitterJSONCollection(InputStream instream) { + currentTweetStream = new BufferedReader(new InputStreamReader(instream)); + JSONStream = new JsonStreamParser(currentTweetStream); } public TwitterJSONCollection() {} @@ -187,7 +193,7 @@ @Override public boolean nextDocument() { - if (FilesToProcess==null) init(); +// if (JSONStream==null) init(); if (JSONStream.hasNext()) { currentDocument = new TwitterJSONDocument(readTweet()); return true; --------------------------- Thanks for help. BTW, should I put it here or on the issue tracker.
  1. CompressingMetaIndexBuilder.java
    (22 kB)
    Richard McCreadie
    07/Feb/12 1:17 PM
  2. TwitterJSONCollection.java
    (8 kB)
    Richard McCreadie
    07/Feb/12 1:15 PM
  3. TwitterJSONDocument.java
    (23 kB)
    Richard McCreadie
    07/Feb/12 1:15 PM

Activity

Hide
Richard McCreadie added a comment - 07/Feb/12 1:15 PM

Patched TwitterJSONCollection and TwitterJSONDocument classes. Adds an InputStream constructor for TwitterJSONCollection. Minor fixes and improvements to the opperation of TwitterJSONDocument.

Show
Richard McCreadie added a comment - 07/Feb/12 1:15 PM Patched TwitterJSONCollection and TwitterJSONDocument classes. Adds an InputStream constructor for TwitterJSONCollection. Minor fixes and improvements to the opperation of TwitterJSONDocument.
Hide
Richard McCreadie added a comment - 07/Feb/12 1:17 PM

Minor change to CompressingMetaIndexBuilder to enable croping of meta index entries.

Show
Richard McCreadie added a comment - 07/Feb/12 1:17 PM Minor change to CompressingMetaIndexBuilder to enable croping of meta index entries.
Hide
Richard McCreadie added a comment - 07/Feb/12 1:29 PM

The Twitter collection class was not really designed for use with Hadoop, since you would need a huge number of tweets before parallisation becomes needed (it only takes a few hours on one machine for Tweets11). Nevertheless, I have attached a patched version for use with Hadoop. Added the missing InputStream collection constructor that Hadoop inputformat uses. Also added the most up-to-date TwitterJSONDocument class. Confirmed working on Hadoop 0.20.2+228 over three machines. Finally, I have added a minor update to CompressingMetaIndex builder to enable croping of over-large meta index entries. Set metaindex.compressed.cropEntries=true if you get an error like:

java.lang.IllegalArgumentException: Data for key text exceeds max byte length of 903(string length of 300). Crop in the Document, or increase indexer.meta.forward.keylens

The error reported above does not look like a twitter specific error, is the configuration correct?

Show
Richard McCreadie added a comment - 07/Feb/12 1:29 PM The Twitter collection class was not really designed for use with Hadoop, since you would need a huge number of tweets before parallisation becomes needed (it only takes a few hours on one machine for Tweets11). Nevertheless, I have attached a patched version for use with Hadoop. Added the missing InputStream collection constructor that Hadoop inputformat uses. Also added the most up-to-date TwitterJSONDocument class. Confirmed working on Hadoop 0.20.2+228 over three machines. Finally, I have added a minor update to CompressingMetaIndex builder to enable croping of over-large meta index entries. Set metaindex.compressed.cropEntries=true if you get an error like: java.lang.IllegalArgumentException: Data for key text exceeds max byte length of 903(string length of 300). Crop in the Document, or increase indexer.meta.forward.keylens The error reported above does not look like a twitter specific error, is the configuration correct?
Hide
Wen Li added a comment - 20/Feb/12 12:56 PM

Hi, thanks for your help! It fixed the problem!

Show
Wen Li added a comment - 20/Feb/12 12:56 PM Hi, thanks for your help! It fixed the problem!

People

Dates

  • Created:
    31/Jan/12 9:45 AM
    Updated:
    20/Feb/12 12:56 PM
    Resolved:
    20/Feb/12 12:56 PM