We would like to thank all reviewers for their valuable and helpful comments. We are happy to incorporate them into the paper.$ Assigned_Reviewer_4 and Assigned_Reviewer_6: Elaborate on difference to existing MNAR approaches in recommendation. Our approach is different from the existing joint-likelihood imputation methods in two fundamental respects. First, from a technical perspective, our approach is discriminative while the joint-likelihood approach is generative. We can therefore expect to inherit many of the advantages (e.g., efficiency, predictive performance, fewer modeling assumptions) of discriminative methods, and we avoid latent-variable models that are difficult to train. As Assigned_Reviewer_6 points out, propensity scoring estimates expectations (e.g. the risk) directly from the incomplete data, without modeling the full rating distribution. Furthermore, our risk estimation approach is provably unbiased in the controlled experiment setting (e.g. ad recommendation), a setting to which the joint-likelihood imputation approach does not apply. Second, from a conceptual perspective, the causal-inference view we bring to recommendation differs from the joint-likelihood imputation view of prior work in recommendation. This causal view opens several new areas for research that are currently underexplored. For example, it provides a principled framework to explore direct policy learning methods for recommendation that avoid the rating prediction step. Similarly, it motivates propensity estimation as a learning problem of crucial importance for recommendation. We will further emphasize these distinctions in Section 2 of the paper. Assigned_Reviewer_5: “The paper includes many citations to existing work which uses training objectives weighted by inverse propensity scores and the claim that this paper is the first to apply it to recommendation seems like a marginal contribution.” We would like to refer to Assigned_Reviewer_6's "Significance Justification" - we could not say it any better. The simplicity, generality, and modularity of the propensity-scoring approach brings a novel view to recommendation problems that opens new areas for research and is highly practical in applications. We will further clarify our contributions in Sections 1, 2, and 7. Assigned_Reviewer_5: "As for evaluation of MF-IPS , I would like to see comparisons to other methods that have non-random missing data models, for example CPT-v, MM from (Marlin & Zemel, 2009) on Yahoo dataset, which in the text is reported to perform *better* than the proposed method (line 845)." We are already doing this, but we will give this a bigger footprint in the final version of the paper. In particular, we are comparing against published results of Marlin & Zemel's methods on the Yahoo dataset as suggested. As stated in the paper, our method beats their models in terms of MAE and MSE, with the exception of MSE for the MM-CPT-v variant. Their source code is not available, and we trust that the reported numbers are better than any reimplementation of their complex EM-procedure that we could produce. Assigned_Reviewer_5: "Plots should be more consistent. Please reconcile different axes to be always RMSE (or always MSE)." We did this on purpose to distinguish between the error of the risk estimator (where we use RMSE to summarize the bias and variance) and the prediction error of the learning method (where we use MSE). Comparing these two is meaningless, and we did not want to suggest to the reader that comparing them would make sense. We will clarify this further in the experiment description. Assigned_Reviewer_6: "The one aspect of the work that limits its significance is that ratings for a random sample of items are needed to allow for propensity score modeling in the observational setting." Our general framework does not require any such sample to work - this was limited to the Naive Bayes estimator. Note that the Logistic Regression estimator does not require such a sample, but it could not be used for the Yahoo dataset since meaningful features are not included. In fact, since the matrix O provides fully observed data without missing values, there are many intriguing possibilities for designing new propensity estimation methods. For example, one could perform a fully observed Bernoulli matrix factorization of O to estimate P. More generally, we conjecture that there will be much research on improved propensity estimation techniques in the future, since Theorem 5.2 shows that better propensities will further improve recommendation in the observational setting. Assigned_Reviewer_6: Clarify estimation of P(Y_ui) in the Naive Bayes propensity model. We tie parameters over all u and i. So, P(Y_ui) only has 5 parameters (corresponding to the relative frequencies of each star rating) that need to be estimated (or set through an educated guess). We will clarify this in the final version.