[Previous: Desktop Search] [Contents] [Next: Properties in Terrier]

Examples of using Terrier to index TREC collections: WT2G & Blogs06

TREC WT2G Collection

Below, we give an example of using Terrier, in order to index WT2G, a standard TREC test collection. We assume that the operating system is Linux, and that the collection, along with the topics and the relevance assessments, is stored in the directory /local/collections/WT2G/.

#goto the terrier folder
cd terrier

#get terrier setup for using a trec collection
bin/trec_setup.sh /local/collections/WT2G/

#rebuild the collection.spec file correctly
find /local/collections/WT2G/ -type f | sort |grep -v info > etc/collection.spec

#use In_expB2 DFR model for querying
echo uk.ac.gla.terrier.matching.models.In_expB2 > etc/trec.models

#use this file for the topics
echo /local/collections2/WT2G/info/topics.401-450.gz >> etc/trec.topics.list

#use this file for query relevance assessments
echo /local/collections2/WT2G/info/qrels.trec8.small_web.gz >> etc/trec.qrels

#index the collection
bin/trec_terrier.sh -i

#add the language modelling indices for PonteCroft
bin/trec_terrier.sh -i -l

#run the topics, with suggested c value 10.99 
bin/trec_terrier.sh -r -c 10.99
#run topics again with query expansion enabled
bin/trec_terrier.sh -r -q -c 10.99
#run topics again, using PonteCroft language modelling instead of statistical models
bin/trec_terrier.sh -r -l

#evaluate the results in var/results/
bin/trec_terrier.sh -e

#display the Mean Average Precision
tail -1 var/results/*.eval
#MAP should be 
#In_expB2 Average Precision: 0.3160

TREC Blogs06 Collection

This guide will provide a step-by-step example on how to use Terrier for indexing, retrieval and evaluation. We use TREC Blogs06 test collection, along with the corresponding topics and the qrels from TREC 2006 Blog track. We assume that these are stored in the directory /local/collections/Blog06/

Indexing

In the Terrier folder, use trec_setup.sh to generate a collection.spec for indexing the collection:

[user@machine terrier]$ ./bin/trec_setup.sh /local/collections/Blog06/
[user@machine terrier]$ find /local/terrier/Collections/TREC/Blogs06Collection/ -type f  
	| grep 'permalinks-' |sort > etc/collection.spec

This will result in the creation of a collection.spec file, in the etc directory, containing a list of the files in the /local/collections/Blog06/ directory. At this stage, you should check the etc/collection.spec, to ensure that it only contains files that should be indexed, and that they are sorted (ie 20051206/permalinks-000.gz is the first file).

The TREC Blogs06 collection differs from other TREC collections in that not all tags should be indexed. For this reason, you should configure the parse in TRECCollection not to process these tags. Set the following properties in your etc/terrier.properties file:

TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
TrecDocTags.skip=DOCHDR,DATE_XML,FEEDNO,BLOGHPNO,BLOGHPURL,PERMALINK

Finally, the length of the DOCNOs in the TREC Blogs06 collection are 30 characters, longer than the default 20 characters in Terrier. To deal with this, set the property docno.byte.length to 30 in your terrier.properties:

[user@machine terrier]$  echo docno.byte.length=30>>etc/terrier.properties

Now you are ready to start indexing the collection.

[user@machine terrier]$ ./bin/trec\_terrier.sh -i
Setting TERRIER_HOME to /local/terrier
INFO - TRECCollection read collection specification
INFO - Processing /local/collections/Blogs06/20051206/permalinks-000.gz
INFO - creating the data structures data_1
INFO - Processing /local/collections/Blogs06/20051206/permalinks-001.gz
INFO - Processing /local/collections/Blogs06//20051206/permalinks-002.gz
DEBUG - flushing lexicon
<snip>

Indexing will take a reasonable amount of time on a modern machine. Additionally, expect to double indexing time if block indexing is enabled.

Retrieval

Once the index is built, we can do retrieval using the index, following the steps described below.

First, tell Terrier the location of the topics and relevance assessments (qrels).

[user@machine terrier]$ echo /local/collections/Blog06/06.topics.851-900 >> etc/trec.topics.list
[user@machine terrier]$ echo /local/collections/Blog06/qrels.blog06 >> etc/trec.qrels

Next, we should specify the retrieval weighting model that we want to use. In this case we will use the DFR model called PL2 for ranking documents.

echo uk.ac.gla.terrier.matching.models.PL2 > etc/trec.models

Now we are ready to start retrieval. We use the -c to set the parameter of the weighting model to the value 1. Terrier will do retrieval by taking each query (called a topic) from the specified topics file, query the index using it, and save the results to a file in the var/results folder, named similar to PL2c1.0_0.res.

[user@machine terrier]$ ./bin/trec_terrier.sh -r -c 1
Setting TERRIER_HOME to /local/terrier
INFO - 900 : mcdonalds
INFO - Processing query: 900
DEBUG - weighting model: PL2c1.0
DEBUG - 1: mcdonald with 23965 documents (TF is 37855).
DEBUG - Number of docs with +ve score: 23590
DEBUG - number of retrieved documents: 1000
DEBUG - No filters, just Crop: 0, length1000
DEBUG - Resultset is now 1000 long INFO - Time to process query: 0.157
<snip>
INFO - Finished topics, executed 50 queries in 27 seconds, results written to 
	terrier/var/results/PL2c1.0_0.res
Time elapsed: 40.57 seconds.

Evaluation

We can now evaluate the retrieval performance of the generated run using the qrels specified earlier:

[user@machine terrier]$ ./bin/trec_terrier.sh -e
Setting TERRIER_HOME to /local/terrier
INFO - Evaluating result file: /local/terrier/var/results/PL2c1.0_0.res
Average Precision: 0.2703
Time elapsed: 3.177 seconds.

Note that more evaluation measures are stored in the file var/results/PL2c1.0_0.eval

[Previous: Desktop Search] [Contents] [Next: Properties in Terrier]