Paper ID: 152 Title: Linking losses for density ratio and class-probability estimation Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents an in-depth study of the relationship between the problem of density ratio (dr) estimation and that of conditional class probability (ccp) estimation, more specifically the relationships between losses or excess losses for the two problems. The starting observation relies on the simple fact that dr and ccp are in one-to-one correspondance via the function G(x) = x / (1-x) (assuming equal class priors). In this sense the two problems are equivalent and any loss used for the one problem can also be used for the other, albeit with a different link function (the link function is the function linking the theoretical loss-minimizing score function, or Bayes optimal for loss l, to the theoretical object of interest, i.e. (dr) or (ccp), and through which an estimate of this optimal score function can be transformed to an estimated dr or ccp.) The paper pushes the study further in studying relationship between excess losses (regrets) for the score functions and corresponding "distances" between estimated and true objects of interest. In the case of ccp, this has been studied in previous work and regret is equal to a certain Bregman divergence between ccps. The paper establishes that when transformed to the dr case, this can then be interpreted via another Bregman divergence between drs, with a transformed generator. Finally, previous work interpreting regret in the ccp case as a weighted mixture of regrets for the "elementary" cost-sensitive losses can also be lifted to interpret the regret in the dr case similarly as a mixture of elementary regrets using a transformed weight function. This in turn gives rise to design of dr losses having a certain prescribed weight function that is viewed as preferable for some reason. Clarity - Justification: Quality of writing and exposition of the technical results is very good, but maybe too verbose in the interspersed discussions so that it is somewhat detrimental to the overall picture. Significance - Justification: the contributions are interesting, in particular for what concerns the "design" and interpretation of new losses for the dr case based on some desirable properties of the weight function. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Pros of the paper are that it is quite readable, and that the contributions are interesting, in particular for what concerns the "design" and interpretation of new losses for the dr case based on some desirable properties of the weight function. As possible cons, I find the overall progression of the paper verbose and somewhat confusing, and get the impression that the contributions could have been presented in a more straightforward way. Remark: it seems to me that in Proposition 6 the factors 1/2 should be 2 and vice versa. Or maybe I am confused? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The work notes the relationship between density ratio estimation and class probability estimation and exploits this link to define novel density ratio estimates. In particular, estimators are designed and analysed that focus on different aspects of density ratios. Clarity - Justification: The work is fairly readable, but as a reader I have the recurring problem that I wonder every now and then what the actual, overall aim of the paper is. Showing the link between DRE and CPE is very nice, but it is unclear to me what's more. In particular, I wonder why the experimental section is there? What does it really want to demonstrate? As stated in one of its subsection: "our aim is simply to confirm that other CPE losses are also viable." Why "also"? It was shown that KLIEP and LSIF both fit the framework. Are they not considered viable? The ranking experiment is also beyond me; what are we to learn from the results? That LSIF is not the best, fine, but, in the context of the theory in the paper: what is the point? Also, I do not see why we need larger parts of Subsections 2.2 and 5.1. The theory is mostly a restatement of known results, even part of a proof is repeated. How do the details mentioned tie in with the rest of the paper? All in all, I think a to-the-point 4 page paper should have been enough. Significance - Justification: Though not a major breakthrough, I think making the link between DRE and CPE is significant. As indicated already, if it would be up to me, the paper should have limited itself to demonstrating this link. Of much of the rest of the text it seems unclear what it really adds. This is in particular the case for the experimental section. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): There is one final thing that I feel needs to be corrected or at least the problems with it should be noted in the paper. The work gives the impression that providing a good estimator for the density ratio is enough to fix certain problem, e.g. tasks that suffer from covariate shift. This, however, is certainly not the case. In fact, one can easily construct where corrected estimators become worse than the uncorrected ones even in the case where one has the true density ratio. generally the problem is that corrections may easily lead to an unacceptable increase in estimator variance. More can, for instance, be found in the work of Cortes [1], Joachims [2], et al. [1] Cortes, C., Mansour, Y., & Mohri, M. (2010). Learning bounds for importance weighting. In Advances in neural information processing systems (pp. 442-450). [2] Swaminathan, A., & Joachims, T. (2015). The Self-Normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems (pp. 3213-3221). ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper investigates the relationship between density ratio estimation (DRE) and class probability estimation, and show how loss functions designed for one problem can be used to solve the other. Two popular approaches to DRE are KL importance estimation (KLIEP), and least squares importance ļ¬tting (LSIF). The authors show that KLIEP and LSIF both employ class probability estimation (CPE) losses. The authors formally relate DRE and CPE: For the DRE problem, they show that essentially any CPE loss can be used and this equivalently minimizes a Bregman divergence to the true density ratio. Clarity - Justification: The paper is generally easy follow, some parts, however, are very vague. For example, it is not always clear in Section 1 what the authors meant to say. Even in this early part of the paper, mathematical expressions could have helped to improve the clarity of the paper. Line 066, "But this suffers...": Please try to avoid starting sentences with "But". Significance - Justification: The results can be interesting to a subset of the machine learning community. To make the paper more interesting to the wider machine learning community, in my opinion the paper needs a better motivation and explanation why these results are important, interesting, and useful. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper contains enough novel results, but the presentation of these results needs improvement in my opinion. The structure of the paper needs some changes, because sometimes I felt that paper is a collection of loosely connected parts. The authors should improve on the motivation of the paper. Why are these results important and interesting? Who will use them? How will use them? What are the implications of these results? For example, the authors show a new application for DRE losses, but it is not clear how much we will need these results in this paper. It is also not clear to me how significant the results are in Section 8. The authors mention the applications of density ratio (e.g. covariate shift problems), but it is not clear how the results of this paper can be applied in important applications. The results in Section 8 are not convincing enough. This section needs more details and better presentation. Many questions still remain open. For example, while the paper shows how different loss functions can be used for density estimation, it is not clear what happens to the rate of these estimators. =====