[LTR] RankLib.jar 包介绍-白红宇的个人博客

[LTR] RankLib.jar 包介绍

发布日期：2021-05-09 05:15:32 浏览次数：18 分类：技术文章

本文共 10601 字，大约阅读时间需要 35 分钟。

一、介绍

RankLib.jar 是一个学习排名（Learning to rank）算法的库，目前已经实现了如下几种算法：

MART

RankNet

RankBoost

AdaRank

Coordinate Ascent

LambdaMART

ListNet

Random Forests

Linear regression

二、

Usage: java -jar RankLib.jar 
   
    Params:  [+] Training (+ tuning and evaluation)        # 训练数据        -train 
    
                Training data        # 指定排名算法        -ranker 
     
                Specify which ranking algorithm to use                                0: MART (gradient boosted regression tree)                                1: RankNet                                2: RankBoost                                3: AdaRank                                4: Coordinate Ascent                                6: LambdaMART                                7: ListNet                                8: Random Forests                                9: Linear regression (L2 regularization)        # 特征描述文件，列出要学习的特征，每行一个特征，默认使用所有特征        [ -feature 
      
        ]     Feature description file: list features to be considered by the learner, each on a separate line                                If not specified, all features will be used.        #         [ -metric2t 
       
         ]  Metric to optimize on the training data. Supported: MAP, NDCG@k, DCG@k, P@k, RR@k, ERR@k (default=ERR@10)        [ -gmax 
         ] Highest judged relevance label. It affects the calculation of ERR (default=4, i.e. 5-point scale {0,1,2,3,4}) [ -silent ] Do not print progress messages (which are printed by default) # 是否在验证数据集上调整模型 [ -validate 
         
           ] Specify if you want to tune your system on the validation data (default=unspecified) If specified, the final model will be the one that performs best on the validation data # 训练-验证数据集的分割比例 [ -tvs 
          
            ] If you don't have separate validation data, use this to set train-validation split to be (x)(1.0-x) # 学习模型保存到指定文件 [ -save 
           
             ] Save the model learned (default=not-save) # 是否要在数据上测试训练的模型 [ -test 
            
              ] Specify if you want to evaluate the trained model on this data (default=unspecified) # 训练-测试数据集的分割比例 [ -tts 
             
               ] Set train-test split to be (x)(1.0-x). -tts will override -tvs # 默认与 metric2t 一致 [ -metric2T 
              
                ] Metric to evaluate on the test data (default to the same as specified for -metric2t) # 归一化特征向量，方法包括求和归一化，均值/标准差归一化，最大值/最小值归一化 [ -norm 
               
                ] Normalize all feature vectors (default=no-normalization). Method can be: sum: normalize each feature by the sum of all its values zscore: normalize each feature by its mean/standard deviation linear: normalize each feature by its min/max values # 在训练数据集上执行交叉验证 [ -kcv 
                
                  ] Specify if you want to perform k-fold cross validation using the specified training data (default=NoCV) -tvs can be used to further reserve a portion of the training data in each fold for validation # 交叉验证训练库模型的目录 [ -kcvmd 
                 
                   ] Directory for models trained via cross-validation (default=not-save) [ -kcvmn 
                  
                    ] Name for model learned in each fold. It will be prefix-ed with the fold-number (default=empty) [-] RankNet-specific parameters # 特定参数 # 训练迭代次数 [ -epoch 
                   
                     ] The number of epochs to train (default=100) # 隐含层个数 [ -layer 
                    
                      ] The number of hidden layers (default=1) # 每层隐含节点个数 [ -node 
                     
                       ] The number of hidden nodes per layer (default=10) # 学习率 [ -lr 
                      
                        ] Learning rate (default=0.00005) [-] RankBoost-specific parameters # 特定参数 # 训练迭代次数 [ -round 
                       
                         ] The number of rounds to train (default=300) # 搜索的阈值候选个数 [ -tc 
                        
                          ] Number of threshold candidates to search. -1 to use all feature values (default=10) [-] AdaRank-specific parameters # 特定参数 # 训练迭代次数 [ -round 
                         
                           ] The number of rounds to train (default=500) # [ -noeq ] Train without enqueuing too-strong features (default=unspecified) # 连续两轮学习之间的误差 [ -tolerance 
                          
                            ] Tolerance between two consecutive rounds of learning (default=0.002) # 一个特征可以被连续选择而不改变性能的最大次数 [ -max 
                           
                             ] The maximum number of times can a feature be consecutively selected without changing performance (default=5) [-] Coordinate Ascent-specific parameters # 特定参数 [ -r 
                            
                              ] The number of random restarts (default=5) [ -i 
                             
                               ] The number of iterations to search in each dimension (default=25) [ -tolerance 
                              
                                ] Performance tolerance between two solutions (default=0.001) [ -reg 
                               
                                 ] Regularization parameter (default=no-regularization) [-] {MART, LambdaMART}-specific parameters # 特定参数 # 树的个数 [ -tree 
                                
                                  ] Number of trees (default=1000) # 一个叶子的样本个数 [ -leaf 
                                 
                                   ] Number of leaves for each tree (default=10) # 学习率 [ -shrinkage 
                                  
                                    ] Shrinkage, or learning rate (default=0.1) # 树分割时的候选特征个数 [ -tc 
                                   
                                     ] Number of threshold candidates for tree spliting. -1 to use all feature values (default=256) # 一个叶子最少的样本个数 [ -mls 
                                    
                                      ] Min leaf support -- minimum #samples each leaf has to contain (default=1) [ -estop 
                                     
                                       ] Stop early when no improvement is observed on validaton data in e consecutive rounds (default=100) [-] ListNet-specific parameters [ -epoch 
                                      
                                        ] The number of epochs to train (default=1500) [ -lr 
                                       
                                         ] Learning rate (default=0.00001) [-] Random Forests-specific parameters # 随机森林特定参数 [ -bag 
                                        
                                          ] Number of bags (default=300) # 子集采样率 [ -srate 
                                         
                                           ] Sub-sampling rate (default=1.0) # 特征采样率 [ -frate 
                                          
                                            ] Feature sampling rate (default=0.3) [ -rtype 
                                           
                                             ] Ranker to bag (default=0, i.e. MART) # 树个数 [ -tree 
                                            
                                              ] Number of trees in each bag (default=1) # 每棵树的叶节点个数 [ -leaf 
                                             
                                               ] Number of leaves for each tree (default=100) # 学习率 [ -shrinkage 
                                              
                                                ] Shrinkage, or learning rate (default=0.1) # 树分割时使用的候选特征阈值个数 [ -tc 
                                               
                                                 ] Number of threshold candidates for tree spliting. -1 to use all feature values (default=256) [ -mls 
                                                
                                                  ] Min leaf support -- minimum #samples each leaf has to contain (default=1) [-] Linear Regression-specific parameters [ -L2 
                                                 
                                                   ] L2 regularization parameter (default=1.0E-10) [+] Testing previously saved models # 测试已保存的模型 # 加载模型 -load 
                                                  
                                                    The model to load Multiple -load can be used to specify models from multiple folds (in increasing order), in which case the test/rank data will be partitioned accordingly. # 测试数据 -test 
                                                   
                                                     Test data to evaluate the model(s) (specify either this or -rank but not both) # 对指定文件中的样本排序，与 -test 不能同时使用 -rank 
                                                    
                                                      Rank the samples in the specified file (specify either this or -test but not both) [ -metric2T 
                                                     
                                                       ] Metric to evaluate on the test data (default=ERR@10) [ -gmax 
                                                       ] Highest judged relevance label. It affects the calculation of ERR (default=4, i.e. 5-point scale {0,1,2,3,4}) [ -score 
                                                       
                                                        ] Store ranker's score for each object being ranked (has to be used with -rank) # 打印单个排名列表上的性能（必须与 -test 一起使用） [ -idv 
                                                        
                                                          ] Save model performance (in test metric) on individual ranked lists (has to be used with -test) # 特征归一化 [ -norm ] Normalize feature vectors (similar to -norm for training/tuning)

1. -train <file>

指定训练数据的文件，训练数据格式：

label    qid:$id    $featureid:$featurevalue    $featureid:$featurevalue ... # description

每行代表一个样本，相同查询请求的样本的 qid 相同，label 表示该样本和该查询请求的相关程度，description 描述信息，不参与训练计算。

2、-ranker <type>

指定排名算法

MART（Multiple Additive Regression Tree）多重增量回归树

GBDT（Gradient Boosting Decision Tree）梯度渐进决策树

GBRT（Gradient Boosting Regression Tree）梯度渐进回归树

TreeNet 决策树网络

RankNet

RankBoost

AdaRank

Coordinate Ascent

LambdaMART

ListNet

Random Forests

Linear regression

3、-feature <file>

指定样本的特征定义文件，格式如下：

feature1feature2...# featureK(该特征不参与分析)

4、-metric2t <metric>

指定信息检索中的评价指标，包括：

MAP, NDCG@k, DCG@k, P@k, RR@k, ERR@k

5、Example

java -jar bin/RankLib.jar -train MQ2008/Fold1/train.txt -test MQ2008/Fold1/test.txt -validate MQ2008/Fold1/vali.txt -ranker 6 -metric2t NDCG@10 -metric2T ERR@10 -save mymodel.txt

命令解释 >>>

训练数据：MQ2008/Fold1/train.txt

测试数据：MQ2008/Fold1/test.txt

验证数据：MQ2008/Fold1/vali.txt

排名算法：6，LambdaMART

评估指标：NDCG，取排名前 10 个数据进行计算

测试数据评估指标：ERR，取排名前 10 个数据进行计算

保存模型：mymodel.txt

参数 -validate 是可选的，但可以更好的模型结果，对于 RankNet/MART/LambdaMART 非常重要。

-metric2t 仅应用于 list-wise 算法（AdaRank、Coordinate Ascent 和 LambdaMART）；point-wise 和 Pair-wise 算法（MART、RankNet、RankBoost）是使用自己内部的 RMSE/pair-wise loss 作为评价指标。ListNet 虽然是 list-wise 算法，但是也不用 metric2t 指定评价指标。

6、k-fold cross validation

顺序分区

java -jar bin/RankLib.jar -train MQ2008/Fold1/train.txt -ranker 4 -kcv 5 -kcvmd models/ -kcvmn ca -metric2t NDCG@10 -metric2T ERR@10

按顺序将训练数据拆分5等份，第 i 份数据作为第 i 折叠的测试数据，第 i 折叠的训练数据则是由其他折叠的数据组成。

随机分区

java -cp bin/RankLib.jar ciir.umass.edu.features.FeatureManager -input MQ2008/Fold1/train.txt -output mydata/ -shuffle

将训练数据 train.txt 重新洗牌存储在 mydata/ 目录下 train.txt.shuffled

获取每个折叠中的数据

java -cp bin/RankLib.jar ciir.umass.edu.features.FeatureManager -input MQ2008/Fold1/train.txt.shuffled -output mydata/ -k 5

7、评估已训练的模型

java -jar bin/RankLib.jar -load mymodel.txt -test MQ2008/Fold1/test.txt -metric2T ERR@10

8、模型对比

java -jar bin/RankLib.jar -test MQ2008/Fold1/test.txt -metric2T NDCG@10 -idv output/baseline.ndcg.txtjava -jar bin/RankLib.jar -load ca.model.txt -test MQ2008/Fold1/test.txt -metric2T NDCG@10 -idv output/ca.ndcg.txtjava -jar bin/RankLib.jar -load lm.model.txt -test MQ2008/Fold1/test.txt -metric2T NDCG@10 -idv output/lm.ndcg.txt

输出文件中包含了每条查询的 NDCG@10 指标值，以及所有查询的综合指标，例如：

NDCG@10   170   0.0NDCG@10   176   0.6722390270733757NDCG@10   177   0.4772656487866462NDCG@10   178   0.539003131276382NDCG@10   185   0.6131471927654585NDCG@10   189   1.0NDCG@10   191   0.6309297535714574NDCG@10   192   1.0NDCG@10   194   0.2532778777010656NDCG@10   197   1.0NDCG@10   200   0.6131471927654585NDCG@10   204   0.4772656487866462NDCG@10   207   0.0NDCG@10   209   0.123151194370365NDCG@10   221   0.39038004999210174NDCG@10   all   0.5193204478059303

然后再进行对比：

java -cp RankLib.jar ciir.umass.edu.eval.Analyzer -all output/ -base baseline.ndcg.txt > analysis.txt

对比结果 analysis.txt 如下：

Overall comparison  ------------------------------------------------------------------------  System  Performance     Improvement     Win     Loss    p-value  baseline_ndcg.txt [baseline]    0.093  LM_ndcg.txt     0.2863  +0.1933 (+207.8%)       9       1       0.03  CA_ndcg.txt     0.5193  +0.4263 (+458.26%)      12      0       0.0  Detailed break down  ------------------------------------------------------------------------             [ < -100%)  [-100%,-75%)  [-75%,-50%)  [-50%,-25%)  [-25%,0%)  (0%,+25%]  (+25%,+50%]  (+50%,+75%]  (+75%,+100%]  ( > +100%]  LM_ndcg.txt    0        0           1            0            0         4            2            2            1            0  CA_ndcg.txt    0             0            0            0            0        1            6            2            3            0

9、利用训练模型重排名

java -jar RankLib.jar -load mymodel.txt -rank myResultLists.txt -score myScoreFile.txt

myScoreFile.txt 文件中只是增加了一列，表示重新计算的排名评分，需要自己另外根据该评分排序获取新的排名顺序。

1   0   -7.5286507606506351   1   2.90220618247985841   2   -0.7001255154609681   3   2.3766574859619141   4   -0.296662658452987671   5   -2.0386281013488771   6   -5.2677111625671391   7   -2.0221464633941651   8   0.6741248369216919...

参考

上一篇：3D顶点转换和法线转换

下一篇：[Q&A] 解决 SBT 初始化或下载 jar 包速度慢的问题

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！