Paper ID: 405 Title: A New PAC-Bayesian Perspective on Domain Adaptation Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a PAC-Bayesian bound for unsupervised DA. The difference between domains enters their bound as a multiplicative factor, because they use something like exponent of Renyi divergence instead of the discrepancy used in Ben-David et al (NIPS 2006) and Mansour et al (COLT 2009) works. As a result the bound in this submission makes more sense than the previous PAC-Bayesian bound for DA from ICML’13, because the only non-estimatable term in it does not depend on the choice of the predictor (in contrast in the ICML’13 bound there is a weird term that the authors had to ignore during the optimisation even though it depends on the choice of the predictor). Clarity - Justification: The paper is well written and provides clear definitions of the setup, its gal, motivation and results. Significance - Justification: The paper provides some progress in the area of PAC-Bayes analysis of domain adaptation. However, the added insights that this paper contains are relatively incremental and the direction of PAC-Bayes analysis of domain adaptation does not look very promising. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): My main concern is the used divergence measure. Even though originally the result is more general, eventually the authors consider the case when it is the maximum (inverse) weight ratio (see equation 8). Because it is a constant and is further multiplied with another constant (Corollary 8) in the experiments the authors just choose it based on cross-validation. However, it makes sense only as long as the weight ratio is finite, which can often be not the case. And if it is infinite the whole bound is non-informative. Therefore I do not know how much of the improvement their result gives. In addition, empirical evaluation is also rather weak, they do not report standard deviations and the improvement compared to the baselines seems rather small (less than half of a percent). Though, I guess, the authors consider this work to be theoretical and not that much of practically important. But what I want to say is that the empirical evaluation does not really show how much more informative their bound is. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper addresses the problem of a Domain Adaptation (DA), where the learner observes the labelled data from the source domain and unlabelled data from the target domain (which is assumed to be different from the source), with a final goal of minimizing a prediction risk under the target domain. The paper studies classification problems with the binary loss. The authors approach this problem from the PAC-Bayesian perspective. In particular, the authors derive a novel upper bound on the target risk of the stochastic Gibbs predictor (Theorem 3), involving three different terms (these terms turn out to be closely related to the quantities involved in the classical DA upper bounds). While these quantities depend on the source distribution and the input marginal of the target distribution, the authors apply the well-known PAC-Bayesian bounds to upper bound them using empirical quantities (Corollary 5) and as a result obtain a data-dependent PAC-Bayesian DA upper bound (Theorem 6). Since bounding the risk of the Gibbs classifier was not the main goal of the paper, the authors next specify their upper bound for the class of linear predictors (section 7). In this particular case (a) it is known that the Bayes majority predictor simply coincides with a properly chosen linear predictor (see lines 633-636), (b) the risk of Bayes predictor is upper bounded by twice the risk of Gibbs predictor (this holds for any classes of predictors). These two points imply that the PAC-Bayesian upper bound on the Gibbs predictor translates into the DA upper bound for the individual linear predictors (Theorem 8). Finally, the authors optimize the resulting bound with respect to the weight vector, which results in a learning algorithm called DALC. The authors also provide a number of experiments (on both synthetic and real data), where they compare the new algorithm with several previous state-of-the-art algorithms. Clarity - Justification: The exposition is very nice and easy to follow. The authors provide a nice overview of the previous literature. All the definitions and theoretical results are nicely discussed. However, there is a major concern: the results of Table 1 (the main experimental results supporting the theory developed in the paper) are provided without error terms. This makes the comparison of the presented numbers (of the algorithms) meaningless. Significance - Justification: First of all, I think some of the results proved in the paper are already known (or easily follow from existing results). These include: (1) I think Theorem 4 follows easily from Theorem 5 of (Maurer, A., 2006, "A note on the PAC Bayesian Theorem") when combined with well known bounds on the kl-small function. (2) I also think that Corollary 5 already appeared as Theorem 25 in (Germain et al., 2015, Risk Bounds for the Majority Vote: From a PAC-Bayesian Analysis to a Learning Algorithm). Second, the general idea of the paper is more or less equivalent to the one of (Germain et al., 2013), where the authors used a slightly different upper bound on the true risk of Gibbs predictor instead of Theorem 3, but also combined it with the PAC-Bayesian analysis to obtain the data-dependent upper bound, which was finally minimized. This fact is also openly discussed in the current paper. Considering these two facts, the main novel contribution of the paper is Theorem 3, which is claimed to be superior to the similar inequality of (Germain et al., 2013). This is discussed on lines 412-427 and lines 580-598. The main point of the authors is that the new inequality can be optimized without discarding any of its terms, while the bound of (Germain et al., 2013) included the parameter-dependent term which was unknown and thus had to be discarded in the optimization. Summarizing, we should expect that this results in a different and improved behaviour of DALC compared to the PBDA of (Germain et al., 2013). Unfortunately, as stated above, it is impossible to conclude weather the new method outperforms the old once from Table 1, since the errors are provided without the variances. In the same time, on Figure 2 we see that both methods behave quite similarly and attain identical errors. Overall, I think that the paper does not present significant amount of novel insights and/or novel techniques, useful in the future. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In summary, I think the problem is well motivated and important. The paper is nicely written and has a nice and transparent exposition. However, I feel that the novel results are rather marginal (as detailed above). Minor comments: ============ (1) There are multiple typos in the text. I would suggest to proof read the whole text once again. (2) The quantity defined in (7) may be infinite. In order to define it the authors should put some assumptions. For instance, that T and S are absolutely continuous w.r.t. Lebesgue measure and the ratio of corresponding densities t/s belongs to $L_q$. This also applies to the inequality of line 378 when they apply the Hoelder's inequality. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a new take on PAC-Bayesian generalization analysis of domain adaptation. The main bound depends on the divergence term between source and target join distributions, which is independent from the choice of hypothesis class. This makes the bound different w.r.t. previous works. In more detail, the bound tells us that we can achieve low risk of the Gibbs classifier (GC) on the target domain by choosing the feature space, hypothesis class, and the prior such that: 1) source and target joint distributions are close subject to the divergence term, 2) predictors agree on the labeling of the source domain, in other words GC performs well on the source, 3) predictors agree on the target marginal distribution, 4) there has to be a substantial overlap between supports of source and target joint distributions. The last condition is also related to the usual minimal joint error term in other DA works, however it makes a link to the domain support overlap, which makes it more interpretable. The paper also derives data-dependent bounds and proposes an algorithm motivated by these bounds. Clarity - Justification: The paper is not very easy to follow. Results are scattered all around the paper and start only at page 4. It would help a lot if results would be stated up-front with derivations to follow. Significance - Justification: This paper makes an incremental contribution, but might be of interest to the DA community. The general message is the same as in previous works on DA theory: perform well on the source, source has to be close to the target, and labeling has to be aligned in some way. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): As mentioned in the summary, the main novelty is in the divergence term, which is now hypothesis-class-free. On the downside, it cannot be estimated without labeled data. While deriving their algorithm, authors chose to ignore this term, fixing it as a constant and focused instead on remaining terms. However, it is unclear to which extent the divergence actually affects the generalization. In practice, many previous works in DA introduced different parameterizations in the proxy-distances to analogous divergences (such as H\Delta H) which depend only on the unlabeled data. This allowed for additional flexibility in design of algorithms. In proposed theory one can only do this by using labeled data from both domains, which kills the point of unsupervised DA. =====