We sincerely thank all reviewers for the helpful comments.$
- To reviewer 1
1. The algorithm is very reminiscent of the landmarks paper (Gong et al 2013) 

The landmark paper aims to select a subset of the training data to mimic the feature distribution of the target domain. It does not consider any possible shift in the conditional distribution P(Y|X). We used feature transformation and importance reweighting to achieve this. In addition, we allow location-scale transform in P(X|Y). We will include such discussion in the introduction.

2. The paper is too dense. The authors should consider adding more explanations and move some to the supplement

To make the method more solid, we tried to include the technical details and theoretical analysis, making the paper a little dense. We are trying to move some details to the supplement and add more intuitive explanations.

3. For the real datasets, it would be useful to empirically verify and show the reader how the class distribution changes between domains

The proposed methods perform well because of two reasons. One is the estimation of class ratio. The other is that location-scale transformation allows finding more transferable features. In real data sets, we find the class ratio is usually small. The ratio is between 0.8 and 1.3 on the visual data and is between 0.9 and 1.1 on the Wifi data. Thus, the main improvement on real data is attributed to the location-scale transform. One promising real application of class ratio estimation would be domain adaptation for object detection problem where the object and background regions in the test images are very imbalanced compared to those in the training data collected by humans.

4. Some pieces seem unsupported, such as the target preservation.

In fact, to see how useful the term target information preserving (TIP) is, we compared CTC-TIP with all the other methods without considering such information (see tables 2 and 3). 

5. Figure 4 shows a learned transformation though no discussion is offered to explain to the reader why this is interesting

Thanks for the nice suggestion. We will illustrate how the difference in DIP and CTC is related to the location-scale transformation learned on the data.

6. As a side point, experiments with the office-caltech dataset should use the modern deep features

Thanks for the suggestion—to further improve the performance, indeed, one should use such modern deep features as input. However, because this paper aims to present a rather general approach to domain adaptation with given features, feature learning is not our focus, and we simply used a commonly used feature when comparing different domain adaptation algorithms. Nevertheless, it is very interesting to see how the performance can further benefit from the deep features—we will conduct the study you suggested, which will take some time.


-To reviewer 2
1. The overall formulation involves a discriminative term which is not explained by Theorem 1. How effective is it? What are the results without using this term? 

Please refer to the 4th answer to reviewer 1.

2. It would be great to show some sensitivity analyses of the results w.r.t the hyper-parameters

Thanks for the nice suggestion. We are doing this study now and we will try to include this in the final version.

-To reviewer 3
1. Figure 1 is somewhat confusing, since the semantics of edges is not defined

We were trying to model the causal generating process of the features from the label Y. However, we do not need to know the causal direction between the domain variable and other variables. To avoid any possible confusion, we decided to add these directions to form a causal Bayesian network. Note that the dashed line means that we would like the dependence between Y and X^{orthogonal}  to be as weak as possible.

2. - P^{new} is not defined.

P^{new} is defined in equation (3). It is a new domain constructed from the source domain to match the target domain.

3. It would strengthen the paper to measure the change in P(Y) in the real datasets and relate this to the relative performance of DIP and the proposed method.

Please refer to the third answer to reviewer 1.

4. It appears that hyperparameters are tuned on the target domain for the WiFi task. 

We used a small subset of the test data to tune the hyperparameters only on the Wifi data, as done by Pan et al. 2011 and Zhang et al. 2013b, for fair comparison.