Details
-
Type:
Bug
-
Status:
Resolved
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 3.5
-
Fix Version/s: None
-
Component/s: .indexing
-
- Description:
-
HideHi,
I am trying to use Terrier on Hadoop to index Tweet11 collection by using the plug-inTR-171. However, it doesn't work out of box. It seems that TwitterJSONCollection lacks a constructor over InputStream. I coded such a constructor for it, but still there are some other problems. I run Terrier-3.5 on a pre-build Cloudera VM. The error message shows as follows:
---------------------------------------------------------------
Setting TERRIER_HOME to /home/cloudera/terrier-3.5
INFO - Term-partitioned Mode, 26 reducers creating one inverted index.
INFO - Copying terrier share/ directory (/home/cloudera/terrier-3.5/share) to shared storage area (hdfs://localhost.localdomain/tmp/702454286-terrier.share)
INFO - Copying classpath to job
INFO - Put indices into: hdfs://localhost.localdomain:8020/user/cloudera/index/tweet11
WARN - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
WARN - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
INFO - Allocating 1 files across 1 map tasks
INFO - Running job: job_201201201144_0020
INFO - map 0% reduce 0%
INFO - Task Id : attempt_201201201144_0020_m_000000_0, Status : FAILED
java.lang.NullPointerException
at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260)
at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
attempt_201201201144_0020_m_000000_0: WARNING: The file terrier.properties was not found at location /home/cloudera/terrier-3.5/etc/terrier.properties
attempt_201201201144_0020_m_000000_0: Assuming the value of terrier.home from the corresponding system property.
attempt_201201201144_0020_m_000000_0: 0
attempt_201201201144_0020_m_000000_0: WARN - Snappy native library is available
attempt_201201201144_0020_m_000000_0: INFO - Snappy native library loaded
attempt_201201201144_0020_m_000000_0: INFO - numReduceTasks: 26
attempt_201201201144_0020_m_000000_0: INFO - io.sort.mb = 100
attempt_201201201144_0020_m_000000_0: INFO - data buffer = 79691776/99614720
attempt_201201201144_0020_m_000000_0: INFO - record buffer = 262144/327680
attempt_201201201144_0020_m_000000_0: INFO - Reloading Application Setup
attempt_201201201144_0020_m_000000_0: INFO - Checking memory usage every 20 maxDocPerFlush=0
attempt_201201201144_0020_m_000000_0: INFO - Opening hdfs://localhost.localdomain:8020/user/cloudera/twitter-test.ljson.gz
attempt_201201201144_0020_m_000000_0: INFO - Successfully loaded & initialized native-zlib library
attempt_201201201144_0020_m_000000_0: INFO - Map task_201201201144_0020_m_000000, flush requested, containing 100 documents, flush 0
attempt_201201201144_0020_m_000000_0: INFO - Map task_201201201144_0020_m_000000 finishing, indexed 100 in 0 flushes
attempt_201201201144_0020_m_000000_0: WARN - Error running child
attempt_201201201144_0020_m_000000_0: java.lang.NullPointerException
attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94)
attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260)
attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
attempt_201201201144_0020_m_000000_0: at java.security.AccessController.doPrivileged(Native Method)
attempt_201201201144_0020_m_000000_0: at javax.security.auth.Subject.doAs(Subject.java:396)
attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.Child.main(Child.java:264)
attempt_201201201144_0020_m_000000_0: INFO - Runnning cleanup for the task
INFO - Task Id : attempt_201201201144_0020_m_000000_1, Status : FAILED
java.lang.NullPointerException
at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260)
at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
attempt_201201201144_0020_m_000000_1: WARNING: The file terrier.properties was not found at location /home/cloudera/terrier-3.5/etc/terrier.properties
attempt_201201201144_0020_m_000000_1: Assuming the value of terrier.home from the corresponding system property.
attempt_201201201144_0020_m_000000_1: 0
attempt_201201201144_0020_m_000000_1: WARN - Snappy native library is available
attempt_201201201144_0020_m_000000_1: INFO - Snappy native library loaded
attempt_201201201144_0020_m_000000_1: INFO - numReduceTasks: 26
attempt_201201201144_0020_m_000000_1: INFO - io.sort.mb = 100
attempt_201201201144_0020_m_000000_1: INFO - data buffer = 79691776/99614720
attempt_201201201144_0020_m_000000_1: INFO - record buffer = 262144/327680
attempt_201201201144_0020_m_000000_1: INFO - Reloading Application Setup
attempt_201201201144_0020_m_000000_1: INFO - Checking memory usage every 20 maxDocPerFlush=0
attempt_201201201144_0020_m_000000_1: INFO - Opening hdfs://localhost.localdomain:8020/user/cloudera/twitter-test.ljson.gz
attempt_201201201144_0020_m_000000_1: INFO - Successfully loaded & initialized native-zlib library
attempt_201201201144_0020_m_000000_1: INFO - Map task_201201201144_0020_m_000000, flush requested, containing 100 documents, flush 0
attempt_201201201144_0020_m_000000_1: INFO - Map task_201201201144_0020_m_000000 finishing, indexed 100 in 0 flushes
attempt_201201201144_0020_m_000000_1: WARN - Error running child
attempt_201201201144_0020_m_000000_1: java.lang.NullPointerException
attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94)
attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260)
attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
attempt_201201201144_0020_m_000000_1: at java.security.AccessController.doPrivileged(Native Method)
attempt_201201201144_0020_m_000000_1: at javax.security.auth.Subject.doAs(Subject.java:396)
attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.Child.main(Child.java:264)
attempt_201201201144_0020_m_000000_1: INFO - Runnning cleanup for the task
INFO - Task Id : attempt_201201201144_0020_m_000000_2, Status : FAILED
java.lang.NullPointerException
at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260)
at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
attempt_201201201144_0020_m_000000_2: WARNING: The file terrier.properties was not found at location /home/cloudera/terrier-3.5/etc/terrier.properties
attempt_201201201144_0020_m_000000_2: Assuming the value of terrier.home from the corresponding system property.
attempt_201201201144_0020_m_000000_2: 0
attempt_201201201144_0020_m_000000_2: WARN - Snappy native library is available
attempt_201201201144_0020_m_000000_2: INFO - Snappy native library loaded
attempt_201201201144_0020_m_000000_2: INFO - numReduceTasks: 26
attempt_201201201144_0020_m_000000_2: INFO - io.sort.mb = 100
attempt_201201201144_0020_m_000000_2: INFO - data buffer = 79691776/99614720
attempt_201201201144_0020_m_000000_2: INFO - record buffer = 262144/327680
attempt_201201201144_0020_m_000000_2: INFO - Reloading Application Setup
attempt_201201201144_0020_m_000000_2: INFO - Checking memory usage every 20 maxDocPerFlush=0
attempt_201201201144_0020_m_000000_2: INFO - Opening hdfs://localhost.localdomain:8020/user/cloudera/twitter-test.ljson.gz
attempt_201201201144_0020_m_000000_2: INFO - Successfully loaded & initialized native-zlib library
attempt_201201201144_0020_m_000000_2: INFO - Map task_201201201144_0020_m_000000, flush requested, containing 100 documents, flush 0
attempt_201201201144_0020_m_000000_2: INFO - Map task_201201201144_0020_m_000000 finishing, indexed 100 in 0 flushes
attempt_201201201144_0020_m_000000_2: WARN - Error running child
attempt_201201201144_0020_m_000000_2: java.lang.NullPointerException
attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94)
attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260)
attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
attempt_201201201144_0020_m_000000_2: at java.security.AccessController.doPrivileged(Native Method)
attempt_201201201144_0020_m_000000_2: at javax.security.auth.Subject.doAs(Subject.java:396)
attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.Child.main(Child.java:264)
attempt_201201201144_0020_m_000000_2: INFO - Runnning cleanup for the task
INFO - Job complete: job_201201201144_0020
INFO - Counters: 7
INFO - Job Counters
INFO - SLOTS_MILLIS_MAPS=31057
INFO - Total time spent by all reduces waiting after reserving slots (ms)=0
INFO - Total time spent by all maps waiting after reserving slots (ms)=0
INFO - Launched map tasks=4
INFO - Data-local map tasks=4
INFO - SLOTS_MILLIS_REDUCES=0
INFO - Failed map tasks=1
INFO - Job Failed: NA
ERROR - Problem running job
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1246)
at org.terrier.applications.HadoopIndexing.main(HadoopIndexing.java:231)
at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:376)
at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:569)
at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:237)
Time Taken = 39 seconds
Time elapsed: 39.936 seconds.
----------------------------------------------------------------------
The changes I made are as follows:
----------------------------------------------------------------------
--- <unnamed>
+++ <unnamed>
@@ -28,6 +28,7 @@
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
+import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.zip.GZIPInputStream;
@@ -81,6 +82,11 @@
} catch (IOException ioe) {
logger.error("IOException opening first file of collection - is the collection.spec correct?", ioe);
}
+ }
+
+ public TwitterJSONCollection(InputStream instream) {
+ currentTweetStream = new BufferedReader(new InputStreamReader(instream));
+ JSONStream = new JsonStreamParser(currentTweetStream);
}
public TwitterJSONCollection() {}
@@ -187,7 +193,7 @@
@Override
public boolean nextDocument() {
- if (FilesToProcess==null) init();
+// if (JSONStream==null) init();
if (JSONStream.hasNext()) {
currentDocument = new TwitterJSONDocument(readTweet());
return true;
---------------------------
Thanks for help. BTW, should I put it here or on the issue tracker.
ShowHi, I am trying to use Terrier on Hadoop to index Tweet11 collection by using the plug-inTR-171. However, it doesn't work out of box. It seems that TwitterJSONCollection lacks a constructor over InputStream. I coded such a constructor for it, but still there are some other problems. I run Terrier-3.5 on a pre-build Cloudera VM. The error message shows as follows: --------------------------------------------------------------- Setting TERRIER_HOME to /home/cloudera/terrier-3.5 INFO - Term-partitioned Mode, 26 reducers creating one inverted index. INFO - Copying terrier share/ directory (/home/cloudera/terrier-3.5/share) to shared storage area (hdfs://localhost.localdomain/tmp/702454286-terrier.share) INFO - Copying classpath to job INFO - Put indices into: hdfs://localhost.localdomain:8020/user/cloudera/index/tweet11 WARN - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. WARN - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). INFO - Allocating 1 files across 1 map tasks INFO - Running job: job_201201201144_0020 INFO - map 0% reduce 0% INFO - Task Id : attempt_201201201144_0020_m_000000_0, Status : FAILED java.lang.NullPointerException at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260) at org.apache.hadoop.mapred.Child$4.run(Child.java:272) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) attempt_201201201144_0020_m_000000_0: WARNING: The file terrier.properties was not found at location /home/cloudera/terrier-3.5/etc/terrier.properties attempt_201201201144_0020_m_000000_0: Assuming the value of terrier.home from the corresponding system property. attempt_201201201144_0020_m_000000_0: 0 attempt_201201201144_0020_m_000000_0: WARN - Snappy native library is available attempt_201201201144_0020_m_000000_0: INFO - Snappy native library loaded attempt_201201201144_0020_m_000000_0: INFO - numReduceTasks: 26 attempt_201201201144_0020_m_000000_0: INFO - io.sort.mb = 100 attempt_201201201144_0020_m_000000_0: INFO - data buffer = 79691776/99614720 attempt_201201201144_0020_m_000000_0: INFO - record buffer = 262144/327680 attempt_201201201144_0020_m_000000_0: INFO - Reloading Application Setup attempt_201201201144_0020_m_000000_0: INFO - Checking memory usage every 20 maxDocPerFlush=0 attempt_201201201144_0020_m_000000_0: INFO - Opening hdfs://localhost.localdomain:8020/user/cloudera/twitter-test.ljson.gz attempt_201201201144_0020_m_000000_0: INFO - Successfully loaded & initialized native-zlib library attempt_201201201144_0020_m_000000_0: INFO - Map task_201201201144_0020_m_000000, flush requested, containing 100 documents, flush 0 attempt_201201201144_0020_m_000000_0: INFO - Map task_201201201144_0020_m_000000 finishing, indexed 100 in 0 flushes attempt_201201201144_0020_m_000000_0: WARN - Error running child attempt_201201201144_0020_m_000000_0: java.lang.NullPointerException attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94) attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260) attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.Child$4.run(Child.java:272) attempt_201201201144_0020_m_000000_0: at java.security.AccessController.doPrivileged(Native Method) attempt_201201201144_0020_m_000000_0: at javax.security.auth.Subject.doAs(Subject.java:396) attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) attempt_201201201144_0020_m_000000_0: at org.apache.hadoop.mapred.Child.main(Child.java:264) attempt_201201201144_0020_m_000000_0: INFO - Runnning cleanup for the task INFO - Task Id : attempt_201201201144_0020_m_000000_1, Status : FAILED java.lang.NullPointerException at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260) at org.apache.hadoop.mapred.Child$4.run(Child.java:272) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) attempt_201201201144_0020_m_000000_1: WARNING: The file terrier.properties was not found at location /home/cloudera/terrier-3.5/etc/terrier.properties attempt_201201201144_0020_m_000000_1: Assuming the value of terrier.home from the corresponding system property. attempt_201201201144_0020_m_000000_1: 0 attempt_201201201144_0020_m_000000_1: WARN - Snappy native library is available attempt_201201201144_0020_m_000000_1: INFO - Snappy native library loaded attempt_201201201144_0020_m_000000_1: INFO - numReduceTasks: 26 attempt_201201201144_0020_m_000000_1: INFO - io.sort.mb = 100 attempt_201201201144_0020_m_000000_1: INFO - data buffer = 79691776/99614720 attempt_201201201144_0020_m_000000_1: INFO - record buffer = 262144/327680 attempt_201201201144_0020_m_000000_1: INFO - Reloading Application Setup attempt_201201201144_0020_m_000000_1: INFO - Checking memory usage every 20 maxDocPerFlush=0 attempt_201201201144_0020_m_000000_1: INFO - Opening hdfs://localhost.localdomain:8020/user/cloudera/twitter-test.ljson.gz attempt_201201201144_0020_m_000000_1: INFO - Successfully loaded & initialized native-zlib library attempt_201201201144_0020_m_000000_1: INFO - Map task_201201201144_0020_m_000000, flush requested, containing 100 documents, flush 0 attempt_201201201144_0020_m_000000_1: INFO - Map task_201201201144_0020_m_000000 finishing, indexed 100 in 0 flushes attempt_201201201144_0020_m_000000_1: WARN - Error running child attempt_201201201144_0020_m_000000_1: java.lang.NullPointerException attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94) attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260) attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.Child$4.run(Child.java:272) attempt_201201201144_0020_m_000000_1: at java.security.AccessController.doPrivileged(Native Method) attempt_201201201144_0020_m_000000_1: at javax.security.auth.Subject.doAs(Subject.java:396) attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) attempt_201201201144_0020_m_000000_1: at org.apache.hadoop.mapred.Child.main(Child.java:264) attempt_201201201144_0020_m_000000_1: INFO - Runnning cleanup for the task INFO - Task Id : attempt_201201201144_0020_m_000000_2, Status : FAILED java.lang.NullPointerException at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260) at org.apache.hadoop.mapred.Child$4.run(Child.java:272) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) attempt_201201201144_0020_m_000000_2: WARNING: The file terrier.properties was not found at location /home/cloudera/terrier-3.5/etc/terrier.properties attempt_201201201144_0020_m_000000_2: Assuming the value of terrier.home from the corresponding system property. attempt_201201201144_0020_m_000000_2: 0 attempt_201201201144_0020_m_000000_2: WARN - Snappy native library is available attempt_201201201144_0020_m_000000_2: INFO - Snappy native library loaded attempt_201201201144_0020_m_000000_2: INFO - numReduceTasks: 26 attempt_201201201144_0020_m_000000_2: INFO - io.sort.mb = 100 attempt_201201201144_0020_m_000000_2: INFO - data buffer = 79691776/99614720 attempt_201201201144_0020_m_000000_2: INFO - record buffer = 262144/327680 attempt_201201201144_0020_m_000000_2: INFO - Reloading Application Setup attempt_201201201144_0020_m_000000_2: INFO - Checking memory usage every 20 maxDocPerFlush=0 attempt_201201201144_0020_m_000000_2: INFO - Opening hdfs://localhost.localdomain:8020/user/cloudera/twitter-test.ljson.gz attempt_201201201144_0020_m_000000_2: INFO - Successfully loaded & initialized native-zlib library attempt_201201201144_0020_m_000000_2: INFO - Map task_201201201144_0020_m_000000, flush requested, containing 100 documents, flush 0 attempt_201201201144_0020_m_000000_2: INFO - Map task_201201201144_0020_m_000000 finishing, indexed 100 in 0 flushes attempt_201201201144_0020_m_000000_2: WARN - Error running child attempt_201201201144_0020_m_000000_2: java.lang.NullPointerException attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:94) attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:260) attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.Child$4.run(Child.java:272) attempt_201201201144_0020_m_000000_2: at java.security.AccessController.doPrivileged(Native Method) attempt_201201201144_0020_m_000000_2: at javax.security.auth.Subject.doAs(Subject.java:396) attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) attempt_201201201144_0020_m_000000_2: at org.apache.hadoop.mapred.Child.main(Child.java:264) attempt_201201201144_0020_m_000000_2: INFO - Runnning cleanup for the task INFO - Job complete: job_201201201144_0020 INFO - Counters: 7 INFO - Job Counters INFO - SLOTS_MILLIS_MAPS=31057 INFO - Total time spent by all reduces waiting after reserving slots (ms)=0 INFO - Total time spent by all maps waiting after reserving slots (ms)=0 INFO - Launched map tasks=4 INFO - Data-local map tasks=4 INFO - SLOTS_MILLIS_REDUCES=0 INFO - Failed map tasks=1 INFO - Job Failed: NA ERROR - Problem running job java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1246) at org.terrier.applications.HadoopIndexing.main(HadoopIndexing.java:231) at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:376) at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:569) at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:237) Time Taken = 39 seconds Time elapsed: 39.936 seconds. ---------------------------------------------------------------------- The changes I made are as follows: ---------------------------------------------------------------------- --- <unnamed> +++ <unnamed> @@ -28,6 +28,7 @@ import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; +import java.io.InputStream; import java.io.InputStreamReader; import java.util.ArrayList; import java.util.zip.GZIPInputStream; @@ -81,6 +82,11 @@ } catch (IOException ioe) { logger.error("IOException opening first file of collection - is the collection.spec correct?", ioe); } + } + + public TwitterJSONCollection(InputStream instream) { + currentTweetStream = new BufferedReader(new InputStreamReader(instream)); + JSONStream = new JsonStreamParser(currentTweetStream); } public TwitterJSONCollection() {} @@ -187,7 +193,7 @@ @Override public boolean nextDocument() { - if (FilesToProcess==null) init(); +// if (JSONStream==null) init(); if (JSONStream.hasNext()) { currentDocument = new TwitterJSONDocument(readTweet()); return true; --------------------------- Thanks for help. BTW, should I put it here or on the issue tracker.
Attachments
-
- CompressingMetaIndexBuilder.java
- (22 kB)
- Richard McCreadie
- 07/Feb/12 1:17 PM
-
- TwitterJSONCollection.java
- (8 kB)
- Richard McCreadie
- 07/Feb/12 1:15 PM
-
- TwitterJSONDocument.java
- (23 kB)
- Richard McCreadie
- 07/Feb/12 1:15 PM
Activity
Minor change to CompressingMetaIndexBuilder to enable croping of meta index entries.
The Twitter collection class was not really designed for use with Hadoop, since you would need a huge number of tweets before parallisation becomes needed (it only takes a few hours on one machine for Tweets11). Nevertheless, I have attached a patched version for use with Hadoop. Added the missing InputStream collection constructor that Hadoop inputformat uses. Also added the most up-to-date TwitterJSONDocument class. Confirmed working on Hadoop 0.20.2+228 over three machines. Finally, I have added a minor update to CompressingMetaIndex builder to enable croping of over-large meta index entries. Set metaindex.compressed.cropEntries=true if you get an error like:
java.lang.IllegalArgumentException: Data for key text exceeds max byte length of 903(string length of 300). Crop in the Document, or increase indexer.meta.forward.keylens
The error reported above does not look like a twitter specific error, is the configuration correct?
Patched TwitterJSONCollection and TwitterJSONDocument classes. Adds an InputStream constructor for TwitterJSONCollection. Minor fixes and improvements to the opperation of TwitterJSONDocument.