|
[Go home]
Overview
RankLib is a library of learning to rank algorithms. Currently eight popular algorithms have been implemented:
- MART (Multiple Additive Regression Trees, a.k.a. Gradient boosted regression tree) [6]
- RankNet [1]
- RankBoost [2]
- AdaRank [3]
- Coordinate Ascent [4]
- LambdaMART [5]
- ListNet [7]
- Random Forests [8]
- With appropriate parameters for Random Forests, it can also do bagging several MART/LambdaMART rankers.
It also implements many retrieval metrics as well as provides many ways to carry out evaluation.
Download (Source codes & Binary)
- RankLib-v2.1 [Download]
- Add ListNet
- Add Random Forests
- With little manual work, it can do BagBoo/Bagging LambaMART too.
- Default value for some parameter has been changed.
- Older versions [Here]
The binary is located in the "bin" directory and it should work right out of the box. If you want to re-build it, you need to have ant installed on your machine. Open the terminal, go to the RankLib directory and type "ant" and you're good to go.
Mailing list
If you find RankLib useful and want to receive email notices when a new version / bug fix comes out, please subscribe to this Google groups mailing list.
This list is used strictly for announcement only, so you won't be seeing any discussion/spam emails.
If you are not using gmail, please email me using your
favorite email account saying you want to know about RankLib updates, I will keep you posted.
How to use
Open the terminal and type in "java -jar bin/RankLib.jar" under the RankLib directory, you will see all necessary parameters.
Parameters in the square bracket are optional with their default value shown at the end of the line.
| Usage: java -jar RankLib.jar <Params> |
| Params: |
| [+] Training (+ tuning and evaluation) |
| |
-train <file> |
Training data |
| |
-ranker <type> |
Specify which ranking algorithm to use |
| |
|
0: MART (gradient boosted regression tree) |
| |
|
1: RankNet |
| |
|
2: RankBoost |
| |
|
3: AdaRank |
| |
|
4: Coordinate Ascent |
| |
|
6: LambdaMART |
| |
|
7: ListNet |
| |
|
8: Random Forests |
| |
[ -feature <file> ] |
Feature description file: list features to be considered by the learner, each on a separate line. If not specified, all features will be used. |
| |
[ -metric2t <metric> ] |
Metric to optimize on the training data. Supported: MAP, NDCG@k, DCG@k, P@k, RR@k, ERR@k (default=ERR@10) |
| |
[ -metric2T <metric> ] |
Metric to evaluate on the test data (default to the same as specified for -metric2t) |
| |
[ -gmax <label> ] |
Highest judged relevance label. It affects the calculation of ERR (default=4, i.e. 5-point scale {0,1,2,3,4}) |
| |
[ -test <file> ] |
Specify if you want to evaluate the trained model on this data (default=unspecified) |
| |
[ -validate <file> ] |
Specify if you want to tune your system on the validation data (default=unspecified). If specified, the final model will be the one that performs best on the validation data |
| |
[ -tvs <x \in [0..1]> ] |
Set train-validation split to be (x)(1.0-x) |
| |
[ -tts <x \in [0..1]> ] |
Set train-test split to be (x)(1.0-x). -tts will override -tvs |
| |
[ -kcv <k> ] |
Specify if you want to perform k-fold cross validation using ONLY the specified training data (default=NoCV) |
| |
[ -norm <method>] |
Normalize feature vectors (default=no-normalization). Method can be: |
| |
|
sum: normalize each feature by the sum of all its values |
| |
|
zscore: normalize each feature by its mean/standard deviation |
| |
[ -save <model> ] |
Save the learned model to the specified file (default=not-save) |
| |
[ -silent ] |
Do not print progress messages (which are printed by default) |
| [-] RankNet-specific parameters |
| |
[ -epoch <T> ] |
The number of epochs to train (default=100) |
| |
[ -layer <layer> ] |
The number of hidden layers (default=1) |
| |
[ -node <node> ] |
The number of hidden nodes per layer (default=10) |
| |
[ -lr <rate> ] |
Learning rate (default=0.00005) |
| [-] RankBoost-specific parameters |
| |
[ -round <T> ] |
The number of rounds to train (default=300) |
| |
[ -tc <k> ] |
Number of threshold candidates to search. -1 to use all feature values (default=10) |
| [-] AdaRank-specific parameters |
| |
[ -round <T> ] |
The number of rounds to train (default=500) |
| |
[ -noeq ] |
Train without enqueuing too-strong features (default=unspecified) |
| |
[ -tolerance <t> ] |
Tolerance between two consecutive rounds of learning (default=0.002) |
| |
[ -max <times> ] |
The maximum number of times can a feature be consecutively selected without changing performance (default=5) |
| [-] Coordinate Ascent-specific parameters |
| |
[ -r <k> ] |
The number of random restarts (default=2) |
| |
[ -i <iteration> ] |
The number of iterations to search in each dimension (default=25) |
| |
[ -tolerance <t> ] |
Performance tolerance between two solutions (default=0.001) |
| |
[ -reg <slack> ] |
Regularization parameter (default=no-regularization) |
| [-] {MART, LambdaMART}-specific parameters |
| |
[ -tree <t> ] |
Number of trees (default=1000) |
| |
[ -leaf <l> ] |
Number of leaves for each tree (default=10) |
| |
[ -shrinkage <factor> ] |
Shrinkage, or learning rate (default=0.1) |
| |
[ -tc <k> ] |
Number of threshold candidates for tree spliting. -1 to use all feature values (default=256) |
| |
[ -mls <n> ] |
Min leaf support -- minimum #samples each leaf has to contain (default=1) |
| |
[ -estop <e> ] |
Stop early when no improvement is observed on validaton data in e consecutive rounds (default=100) |
| [-] ListNet-specific parameters |
| |
[ -epoch <T> ] |
The number of epochs to train (default=1500) |
| |
[ -lr <rate> ] |
Learning rate (default=0.00001) |
| [-] Random Forests-specific parameters |
| |
[ -bag <b> ] |
Number of bags (default=300) |
| |
[ -srate <r> ] |
Sub-sampling rate (default=1.0) |
| |
[ -frate <r> ] |
Feature sampling rate (default=0.3) |
| |
[ -rtype <type> ] |
Ranker to bag (default=0, i.e. MART) |
| |
[ -tree <t> ] |
Number of trees (default=1) |
| |
[ -leaf <l> ] |
Number of leaves for each tree (default=100) |
| |
[ -shrinkage <factor> ] |
Shrinkage, or learning rate (default=0.1) |
| |
[ -tc <k> ] |
Number of threshold candidates for tree spliting. -1 to use all feature values (default=256) |
| |
[ -mls <n> ] |
Min leaf support -- minimum #samples each leaf has to contain (default=1) |
| [+] Testing previously saved models |
| |
-load <model> |
The model to load |
| |
-test <file> |
Test data to evaluate the model (specify either this or -rank but not both) |
| |
-rank <file> |
Rank the samples in the specified file (specify either this or -test but not both) |
| |
[ -metric2T <metric> ] |
Metric to evaluate on the test data (default=ERR@10) |
| |
[ -gmax <label> ] |
Highest judged relevance label. It affects the calculation of ERR (default=4, i.e. 5-point scale {0,1,2,3,4}) |
| |
[ -score <file> ] |
Store ranker's score for each object being ranked (has to be used with -rank) |
| |
[ -idv ] |
Print model performance (in test metric) on individual ranked lists (has to be used with -test) |
| |
[ -norm ] |
Normalize feature vectors (similar to -norm for training/tuning) |
File format
The file format of the training and test and validation files is the same as for SVM-Rank. This is also the format used in LETOR datasets. Each of the following lines represents one training example and is of the following format:
<line> .=. <target> qid:<qid> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. <float>
<qid> .=. <positive integer>
<feature> .=. <positive integer>
<value> .=. <float>
<info> .=. <string>
Here's an example: (taken from the SVM-Rank website). Note that everything after "#" are ignored.
3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A
2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B
1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C
1 qid:1 1:0 2:0 3:1 4:0.3 5:0 # 1D
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2A
2 qid:2 1:1 2:0 3:1 4:0.4 5:0 # 2B
1 qid:2 1:0 2:0 3:1 4:0.1 5:0 # 2C
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2D
2 qid:3 1:0 2:0 3:1 4:0.1 5:1 # 3A
3 qid:3 1:1 2:1 3:0 4:0.3 5:0 # 3B
4 qid:3 1:1 2:0 3:0 4:0.4 5:1 # 3C
1 qid:3 1:0 2:1 3:1 4:0.5 5:0 # 3D
Examples
Go to the LETOR website and download any of their datasets. For instance, let's pick MQ2008 from the LETOR 4.0 dataset. Suppose we put it under the RankLib directory. Type this into the command-line:
java -jar bin/RankLib.jar -train MQ2008/Fold1/train.txt -test MQ2008/Fold1/test.txt -validate MQ2008/Fold1/vali.txt -ranker 6 -metric2t NDCG@10 -metric2T ERR@10
What we specified means we want to train a LambdaMART ranker: train on the training data and record the model that performs best on the validation data. The training metric is NDCG@10. After training is completed, evaluate the trained model on the test data in ERR@10.
Important things you should know
The parameter -validate is optional but it often leads to better models. In particular, -validate is very important for RankNet/MART/LambdaMART. Coordinate Ascent, on the other hand, works pretty well without validation data. As a result, my implementation of Coordinate Ascent completely ignores the validation data to reduce training time (this is the only exception in RankLib).
-metric2t (e.g. NDCG, ERR, etc) only applies to list-wise algorithms (AdaRank, Coordinate Ascent and LambdaMART). Point-wise and pair-wise techniques (MART, RankNet, RankBoost, Random Forests), due to their nature, always use their internal RMSE / pair-wise loss as the optimization criteria. Thus, -metric2t has no effects on them. ListNet, despite being a list-wise method, optimizes for its internal list-based metric. As a result, -metric2t has no effects on it either.
License
RankLib is available under BSD license.
References
[1] C.J.C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton and G. Hullender. Learning to rank using gradient descent. In Proc. of ICML, pages 89-96, 2005.
[2] Y. Freund, R. Iyer, R. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. The Journal of Machine Learning Research, 4: 933-969, 2003.
[3] J. Xu and H. Li. AdaRank: a boosting algorithm for information retrieval. In Proc. of SIGIR, pages 391-398, 2007.
[4] D. Metzler and W.B. Croft. Linear feature-based models for information retrieval. Information Retrieval, 10(3): 257-274, 2000.
[5] Q. Wu, C.J.C. Burges, K. Svore and J. Gao. Adapting Boosting for Information Retrieval Measures. Journal of Information Retrieval, 2007.
[6] J.H. Friedman. Greedy function approximation: A gradient boosting machine. Technical Report, IMS Reitz Lecture, Stanford, 1999; see also Annals of Statistics, 2001.
[7] Z. Cao, T. Qin, T.Y. Liu, M.F. Tsai, and H. Li. Learning to Rank: From Pairwise Approach to Listwise Approach. ICML 2007
[8] L. Breiman. Random Forests. Machine Learning 45 (1): 5–32, 2001.
Feedback
Any question/suggestion/bug, please email me at "vdang at cs.umass.edu".
|