Paper ID: 1275 Title: Domain Adaptation with Conditional Transferable Components Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposed to learn transfer components conditioning over the label Y for unsupervised domain adaptation. Similarly to the traditional domain adaptation methods, it learns to match the marginal distributions over the input in a transformed space. However, it interpret the marginal distribution by a causal model, which gives rise to both a novel algorithm and a novel bound on the target domain. Experiments on synthetic data, object recognition, and wifi localization verify the effectiveness of the proposed approach, though the experiment setup could be more clear. Clarity - Justification: The paper is well-written and sufficient details are provided for the algorithm. Significance - Justification: This paper solves domain adaptation from a new perspective. It may spur more research interests in domain adaptation. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I am overall positive about this paper. The conditional transferable components are a natural extension to the target and conditional shifts (Zhang 2013a). However, this extension embodies almost every aspect: 1. Comparing to (Zhang 2013a), the domain transferable component (linear transformation of features) is introduced. 2. The paper lays out theoretic foundations for the proposed algorithm with assumption 1. 3. A bound on the target domain is provided. 4. The paper also presents extensive experiments on synthetic data, object recognition, and wifi localization. One point of the paper could be strengthened. The overall formulation involves a discriminative term which is not explained by Theorem 1. How effective is it? What are the results without using this term? Another minor point is that, it would be great to show some sensitivity analyses of the results w.r.t the hyper-parameters \lambda_S, \lambda_L, \lambda. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes an unsupervised domain adaptation approach which seeks to overcome the case where both the conditional distribution P(X | Y) and the label space distribution P(Y) change from source to target domains. The approach consists of finding a transformation on the source and target domains such that the conditional distributions are equal: Ps (T(X) | Y) = Pt(T(X) | Y). The label space distribution for the target domain is estimated using the source domain known labels and the target domain available unlabeled data. Thus, the conditional distribution shift used for labeling, P(Y | X), can be corrected. The method aligns the source and target distributions by finding the source data points which minimize the MMD with the target distribution, under some affine transformation on the source domain. The target label distribution is then simply the empirical distribution of the chosen source examples. Clarity - Justification: This paper suffers from an abundance of notation which at times obscures the message. The actual algorithm is quite simple and very related to prior work, though it's hard to tell at first glance. The authors should consider adding more intuitive and direct explanations around the equations and perhaps also move some sections to the supplement. Significance - Justification: The algorithm is very reminiscent of the landmarks paper (Gong et al 2013) and the authors should explicitly describe this connection as well as their contribution in this context. That being said, there is novelty here and some interesting analysis. Additionally, the authors do experimentally compare to other methods (including landmarks) and appear to offer some quantitative improvements. The target information preserving is interesting but seems relatively unsupported by experiments. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper offers a simple algorithm which appears to outperform the prior work experimentally. However, the paper as currently written is too dense making the work difficult to digest. In addition some pieces seem unsupported, such as the target preservation. For the real datasets, it would be useful to empirically verify and show the reader how the class distribution changes between domains as this is one of the main claims of the paper to address this situation. In general the paper would benefit from more explanation surrounding both equations and tables/figures. For example, Figure 4 shows a learned transformation though no discussion is offered to explain to the reader why this is interesting/expected/useful. As a side point, experiments with the office-caltech dataset should use the modern deep features as input so that the performance can be compared to current state-of-the-art adaptation methods rather than only those from 2013 and earlier. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a solution to domain adaptation based on "conditional transferable components." The approach searches for a linear transformation of the input features that satisfies two criteria: (1) the conditional distributions of each component P(X^c|Y) are approximately invariant from source to target, and (2) the components retain discriminatory power w.r.t. Y. Experiments on 1 synthetic and 2 real datasets suggest that the approach is comparable or better than many state-of-the-art baselines. Clarity - Justification: - Figure 1 is somewhat confusing, since the semantics of edges is not defined. It's not clear that it is worth the space, unless made more clear. - P^{new} is not defined. Significance - Justification: The approach appears novel, though it is closely related to domain invariant projection (Baktashmotlagh et al., 2013). Figure 2 does a good job describing the differences with DIP; namely, if P(Y) changes from source to target, DIP may incorrectly identify a component as invariant. It would strengthen the paper to measure the change in P(Y) in the real datasets and relate this to the relative performance of DIP and the proposed method. Does the behavior observed in synthetic data also arise in real data? Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, the paper provides strong theoretical and empirical support of the proposed approach. The proposed method outperforms all baselines on 11/15 comparisons. One concern: it appears that hyperparameters are tuned on the target domain for the WiFi task. Are the hyper parameters of all methods tuned in this way? Are the number of hyperparameters the same for all methods? Is this only for the WiFi task, or for all tasks? If only for WiFi, why? I'd like to be sure the comparisons are fair here. =====