Paper ID: 1206 Title: Meta--Gradient Boosted Decision Tree Model for Weight and Target Learning Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): In the context of learning to rank for search engines, the authors present a method to use crowdsourced labels to learn a ranking function. Assuming access to a smaller set of gold-labels which serves as a "validation set", the authors propose to learn mappings from contextual features of a crowdsourced label (e.g. index of the worker) to the weight and target of a weighted linear regression applied to the crowdsourced dataset. The procedure allows to learn the mapping so as to optimize the performance of the regression function on the reference "validation set". Experiments are presented on a large proprietary dataset Clarity - Justification: The paper is very well written overall. Yet, I am not sure I perfectly understood the experimental setup (see comments below), so I do not rate "excellent" for now. Significance - Justification: The main idea of using learning the weights of a weighted least-square regression trained poorly-labeled data in a way that optimizes the error on a gold-standard was introduced by Ustinovskiy et al. (2015), in the context of learning from implicit feedback. This paper presents several novelties: (1) learning weights and transformations of targets, (2) detailed experiments on crowdsourced data, including the additional features they use. Some details also change (e.g. they use LambdaRank instead of SoftRank), (3) some implementation details, such as using decision trees/lambdaRank vs using linear functions/softrank. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I think the algorithm presented in the paper is interesting and the presentation is clear. The approach is quite original compared to usual work on learning to rank. While most of the novelty was previously introduced in Ustinovskiy et al., the novelty is fair and the experimental results seem quite good. I have several comments, though: (1) I am not sure I perfectly understood the experimental protocol, but it seems to me that the baselines (appart from the linear baseline from Ustinovskiy et al.) do not use the gold-labeled training data. There would be simple ways to do so though, such as weighting two objective functions, one computed on the gold-labels and another one computed on the crowdsourced data. (2) I would have liked to see the performance of learning only from the gold-labels; how much does the cowdsourced data helps ? (3) experiments are carried out on a single proprietary dataset. The framework could also be applied in a setting more similar to Ustinovskiy et al. (2015) (using implicit user feedback), for a more complete comparison with their algorithm. (4) learning with several sources of information is not really new, but there is very little background on previous methods e.g. Learning to Rank with Multiple Objective Functions (Svore et al. WWW'11)), even though these should be applicable in the current context. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): In this paper, the author considers how to improve labeled data's quality with machine learning method. Labeled training data are important but one always suffers from the trade-off between the quality and labeling cost. The author proposes a machine learning framework to address this problem with reweighting and remapping labeled training data. In an existing machine learning algorithm, there are both training data (with noisy label) and model. To achieve high score on validation set, the author trains another model (gradient boosted decision tree) to denoise the labels. Clarity - Justification: The author provides the objective function, the gradient of both weight learning and label class learning, and the way of computing the gradients. I don't see obvious mistakes in this part. The notations and formulas are also clear. Significance - Justification: The author provides a new way to address this problem with machine learning. Before this paper, the traditional methods to solve this problem is rules and experiential way. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper contains a good attempt (using machine learning to improve the quality of training data for machine learning). It’s a good view to think about this problem, and the proposed method also significantly outperform the baselines. In the experiment part, the author explains how the proposed framework can be adopted to the problem of learning to rank with crowdsourced labels. In the result, the author's approach behaves better than other noise-reduction algorithms. Maybe it would be bttter to design a simulation experiment to better explans whether the data after reweighting and remapping will be more correct. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors use GBDT on label features, to learn reweighting of training examples, and a mapping of crowdsourced targets, on a ranking problem, based on the gradients derived from a linear model. Application of the linear model on these enhanced targets yields significant improvement over several baselines. Clarity - Justification: This paper is very well-written. While the material presented is not trivial, the paper was easy to follow, and a pleasure to read. Significance - Justification: The work presented is well supported by thorough references and justifications. From an application perspective, ranking is a problem of central commercial importance, and any significant gain in relevance is of high impact. More generally, and efficient method to handle crowdsourced labels is presented, and that could generalize to any application where such labels are available. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): My biggest concern about the whole approach is actually alluded to in the last sentence of the paper. While the GLS model is useful to derive sensible gradient formulations for the weights and targets, the final ranking model itself might be overly simple, especially considering the complex document-queries feature sets typically computed by industrial search engines. Hence, it seems that a reasonable approach might be to try to learn more complex models (e.g. another instance of GBDT, or some neural networks) on the targets generated by the method presented in this paper. Usage of the weights would be less obvious in that context, but they nevertheless could be used as additional covariates. While this approach would likely be less well-principled, it would nevertheless have the potential to improve the performance of the linear model. Also, the approach discussed on line 752 somewhat addresses this issue, but it is still limited to an additional set of features that can only be ultimately combined linearly. Some minor comments: -Line 029: Typo, missing "experiments" -Eq. 5 seems to assume that L is a sum, which is very likely to be the case, but maybe it could be mentioned. -Line 474: algorithms -Algorithm 1, line 4: Maybe J is different in lines 5 and 6 -Line 499: In most GBDT implementations (e.g. TreeBoost), epsilon is optimized by line search for each additional tree (or even tree regions). =====