Paper ID: 784 Title: Recommendations as Treatments: Debiasing Learning and Evaluation Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper aims to construct an analogy between "exposure" to items in recommender systems and exposure to treatments in medical studies. What this means in practice is that the authors draw upon ideas from causal inference, and propensity score matching etc. in order to try and derive unbiased estimators of performance. The challenge of estimating recommendation quality is a difficult one, since one doesn't observe the counterfactual scenario, i.e., how good a recommendation that was not made _would_ have been. This is where propensity score matching can help, to identify users that would react similarly given the same treatment. The system is evaluated by introducing larger and larger amounts of selection bias into the training data and measuring the robustness of the approaches being compared. Clarity - Justification: The paper is clearly written. Significance - Justification: The application of propensity score matching ideas to the evaluation of recommender systems is nice, and as far as I am aware is new. I don't know that the discussion of medical treatment is necessary, propensity score matching is used in a variety of observational study settings, medical or otherwise. I'm not an expert on the topic so I'm not entirely sure how immediate an application of the idea this is, but it's a thorough enough exposition and novel as far as I know. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I had a little trouble understanding the contribution precisely. Obviously (as the authors mention) missing-not-at-random models and such are popular within recommender systems, and already have the goal of correcting for the type of bias issues described in Fig. 1. Propensity score matching etc. are perhaps less known for this type of application, but I had trouble understanding what was novel between the example in Fig.1, the goal of bias correction, the use of propensity score matching, etc. In any case, the idea of using propensity score matching to evaluate recommender systems seems to be a nice idea. It's a concept I have only passing familiarity with, so I'm not entirely able to evaluate to what extent the application of it here is "obvious" or not. The evaluation is nice. It includes both a large dataset from Yahoo, and a dataset the authors collected. Although the dataset source has been obfuscated for the review copy, some more details should be given, e.g. the number of users is not stated. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper investigates the topic of using propensity scores to debias evalution metrics: MSE, MAE and DCG for recommendation systems and also train with an unbiased estimate of the expected loss. The main proposed idea for debiasing training is to weight the examples via inverse propensity in the training objective. This allows to use MNAR data as long as there is a way to estimate the propensities. The authors provide two examples of (known) techniques for estimating the propensities and include discussions experiments on the robustness of evaluation to the inaccuracy of the propensities. Clarity - Justification: Most of the paper is well written. The experimental section contains many typos and seems to have been written in a rush. Significance - Justification: The ideas of using unbiased estimators via inverse propensity scoring and using them for training are widely known and claiming them as contributions of this paper is a little misleading. The paper includes many citations to existing work which uses training objectives weighted by inverse propensity scores and the claim that this paper is the first to apply it to recommendation seems like a marginal contribution. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper investigates the topic of using propensity scores to debias evalution metrics: MSE, MAE and DCG for recommendation systems and also train with an unbiased estimate of the expected loss. The main proposed idea for debiasing training is to weight the examples via inverse propensity in the training objective. This allows to use MNAR data as long as there is a way to estimate the propensities. The authors provide two examples of (known) techniques for estimating the propensities and include discussions experiments on the robustness of evaluation to the inaccuracy of the propensities. The idea of training with inverse propensities is not novel, there is prior work on using inverse propensity scores for training in various domains including domain adaptation, learning with covariate shift, active learning, contextual bandits, etc. As for evaluation of MF-IPS , I would like to see comparisons to other methods that have non-random missing data models, for example CPT-v, MM from (Marlin & Zemel, 2009) on Yahoo dataset, which in the text is reported to perform *better* than the proposed method (line 845). Detailed comments: * A lot of experiments are done on semi-simulated data (ML100K observational model). There is little rational on the propensity model chosen, and whether the results in learning being so robust be an artifact of that particular propensity model. The paper would be strengthened if the reported results do not qualitatively change with a different propensity model. * Plots should be more consistent. Please reconcile different axes to be always RMSE (or always MSE). * Plots need to be more legible. Please use different line styles (not just solid) and make sure the colors are distinguishable. * Line 690: Figure 2 does not have results for MAE as referred in text. * Line 764: Figure 3 -> Figure 4 * Figure 4 does not have RMSE for IPS - but caption has. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a new approach to de-biasing learning and evaluation in recommender systems where the items that users rate are determined by the users themselves, or by the system in a closed loop. Both these settings result in data with strong selection bias effects that result in biased learning and performance evaluation. The approach presented by the authors is based on propensity scoring techniques from causal modeling. The authors propose both propensity weighted evaluation metrics, and a framework for learning matrix factorization models under these metrics. Clarity - Justification: The paper is very well written and easy to follow. Significance - Justification: The methods shown for de-biasing evaluation are quite simple and the results look very promising. The idea of ERM under propensity adjusted objective is also interesting as it is perfectly modular, allowing for separate estimation of propensity weights and rating model parameters. This means it should be extendable to any rating prediction model. This is a strong practical advantage of the proposed approach over the joint observation/data model approach explored by Marlin and Zemel and others. The one aspect of the work that limits its significance is that ratings for a random sample of items are needed to allow for propensity score modeling in the observational setting. This is a significant limitation as for many existing data sets (the real MovieLens data sets, the Netflix data set, etc.), no such random sample exists. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, this paper is very strong. One area that could use more discussion is the precise relationship between this approach and the earlier joint modeling approach introduced by Marlin and Zemel. Propensity score matching derives from the Rubin causal model, which is a theory of causal inference based on the potential outcomes framework. The potential outcomes framework is a missing data interpretation of counterfactual inference that leveraged Rubin's earlier work on missing data. Propensity score matching and methods for treating MNAR data are thus tightly linked together. For example, the propensity estimator shown in Equation 21 is very closely related to the estimator for the CPT-v mechanism based on ratings for a random sample of data as described by Marlin (the CPT-v+ mechanism/estimator). It is clear in Marlin that this mechanism is learned in a way that does not depend on the user or the item. In the present work, the it is not completely clear how P(Yu,i) is estimated since Yu,i is user u's rating for item i. The model for P(Yu,i) should be described more explicitly. For example, is it tied across all users or all items or both? This clearly affects the sample complexity of estimating the corresponding parameters. One of the main difference between the current approach and the joint modeling approach approaches seems to be that the joint modeling approach involves an expectation over the missing ratings due to the coupling between the response model and data model. The proposed approach is based only on re-weighting the observed data. Reconciling these differences would be quite interesting. =====