We thank the reviewers for their comments and suggestions. We would also like to acknowledge your appreciation of the significance of our paper. $ Reviewer_3: Counterfactual inference as domain adaptation: In Section 2 we show that estimating ITE involves inference on the counterfactual data, which has a different distribution from the available factual labeled data. The factual data acts as training set, and the unlabeled counterfactual data acts as test set. Since in observational studies the factual and counterfactual (i.e. train and test) have different distributions, this is a case of domain adaptation. The reviewer is correct that it’s not self-evident why the distributions of control and treatment should be made similar. We give a heuristic motivation in Section 3, paragraph 3. As the reviewer suggests, the theoretical motivation follows the results of Mansour et al. (2009), as well as Cortes & Mohri (2014). Schölkopf et al. (2012) discuss transfer learning (Section 2.3.1 & 3.3.1) and covariate shift (2.1.1) in relation to learning the causal direction of simple graphs, a question related to ours. We will clarify this. We believe that Lasso+Ridge is a strong baseline for high-dimensional causal inference with squared error. It's exciting that our theory-guided method is competitive with this approach. The linear case is very restrictive - it’s not clear if any objective would do better in this setting. Reviewer_4: We wish to clarify the statement and applicability of Theorem 1. The theorem gives, for all fixed representations Phi, a bound on the *relative* error for a hypothesis trained. It does not take into account how Phi is obtained. Hence, the theory applies even if h(Phi(x),t) is not convex in x, e.g. Phi is a neural net. Since the bound in the theorem is true for all representations Phi, we can attempt to minimize it over Phi, somewhat analogous to how Mohri & Cortes (2014) minimize their upper bound by reweighting distributions. Since our theorem applies to hypotheses that are linear in Phi, it applies directly to the evaluated models BNN-4-0 and BLR. Only for BNN-2-2 is the theorem not directly applicable, since in that case the hypothesis space is not linear. (*) Reviewers 4 & 5: No close neighbors in the opposite treatment group Indeed, the reviewers are correct that lack of close counterfactuals might negatively affect our method. Note that most counterfactual inference methods would fail in this case, unless other strong assumptions are made such as no model misspecification (e.g. the true outcome is linear). We thank the reviewers for the suggestion of evaluating performance as a function of difference between populations, and will investigate this. Reviewer_5: The reviewer is correct in their understanding of BNN-4-0 and BNN-2-2. We will clarify their presentation, add details of the algorithm to the appendix, and release all code. We answer the 7 detailed comments: 1 & 6. The optimization was well-behaved for the datasets evaluated in the paper. The performance of BNN-4-0 could be due to overfitting to the factual. Only BLR was trained using alternating minimization: the neural networks were trained using SGD. Nearest neighbors are chosen and fixed in advance based on Euclidean distance in the input space. 2. We concur with the reviewer’s understanding. 3. Theorem 1 does not directly yield a quantity usable by a practitioner, but provides useful theoretical guidance for the design of causal inference algorithms. The reviewers raise an excellent question about the trustworthiness of causal inference results. This is discussed at length in Rosenbaum’s “Observational Studies” book, and many of the proposed qualitative checks could be applied to our method. It is an important open problem for the machine learning community to provide quantitative answers to this problem. 4. See (*) above. 5. We agree with claim (1): variable selection may yield too weak a representation. We only partially agree with claim (2) as on News, BNN-4-0 actually performs slightly better than the standard neural network, NN-4, even though BNN-4-0 (unlike NN-4) does not include nonlinear interactions between treatment and context. This fact, and the significant improvements of BNN-2-2 over NN-4, that are solely due to balancing representations, both contradict (2). 7. The algorithm and theorem could be readily extended to other loss functions and hypothesis classes. For most cases (e.g. logistic regression) the discrepancy has no closed form, making optimization much more challenging. In practice one could use the linear discrepancy for nonlinear hypothesis spaces (as we did with BNN-2-2) or other losses, without the theoretical justification of Theorem 1. These are some of what we believe are exciting open questions that our work brings to the ML community.