Paper ID: 960 Title: Mixture Proportion Estimation via Kernel Embeddings of Distributions Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a new method to estimate the mixture proportion of two distributions, say H and G, from the sample where the sample is obtained from the mixture distribution and only H not G. The difficulty is that we do not obtain a sample directly from the distribution G. Estimating the mixture proportion typically appears for positive and unlabeled learning (PU-learning). The proposed method utilizes so called C-distance based on the kernel mean embedding of a probability measure. A convergence analysis of the empirical C-distance and a convergence of the estimator based on the empirical C-distance is proven under the separability condition. Clarity - Justification: The motivation and the algorithm are well described. It is easy to implement the method based on the given description. Significance - Justification: The method is a simple application of the kernel mean embedding, and it is easy to implement. Moreover, the convergence analysis indicates that the method is not directly affected by the curse of dimensionality. This is a significant point of the proposed method. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The problem and the method are significant as described in "Significance" comment. However, I have a theoretical concern about the analysis. The analysis does not utilize the uniform bound. Since we need to optimize lambda, the difference between the population and empirical C-distance should be bounded "uniformly" over all lambda. The current analysis gives only for arbitrary chosen "fixed" lambda. Thus the tools given in the paper are not sufficient to prove Theorem 12 and 13. To make the proof rigorous, this flaw should be fixed. The convergence rates for the value and gradient thresholding methods are different (O(m^{-1/4}) or O(m^{-1/2})). Is this just by a technical artifact or is this actually essential difference? In the numerical experiments, the two methods KM_1 and KM_2 show different performance. It would be helpful if there would be explanation about why this difference occurs. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a new algorithm for mixture proportion estimation using kernel mean embeddings. The paper presents both a practical algorithm and a thorough theoretical analysis. Clarity - Justification: The paper is clearly written with no language errors. Significance - Justification: The paper proposes a new algorithm for mixture proportion estimation using kernel mean embeddings. The paper presents both a simple, practical algorithm and a thorough theoretical analysis. Previous algorithms either lacked strong theoretical justification or was difficult to implement (e.g. estimation of derivative of ROC in Blanchard et al. (2010)). This paper clearly improves on the state of the art. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): A weakness of mean-embeddings is that the selection of hyper-parameters is often not possible using the same objective function. What is the motivation for selecting the objective function which maximizes ||\phi(F) - \phi(H)||? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors consider the problem of estimating the weight of a component in a mixture of two distributions. This problem, clearly and simply delimited, appears to be a key issue in a number of learning problems. The paper provides an algorithm, with proven convergence rates under some conditions on the underlying distributions, that uses embedding distributions onto universal RKHS. Experiments prove that this algorithm performs at least as well as state-of-the-art methods. Clarity - Justification: The paper is very clear, very well written and pleasant to read. All proofs are carefully written in the supplementary material (and exhaustive), and clearly summarized in the paper itself. Several figures help to understand the main results. Comparison with other methods, discussion and experiments are properly conducted. Significance - Justification: The issue tackled by the authors is a scientific problem which is central for a number of practical developments. The problem is not easy to solve as it needs to state realistic assumptions, i.e. compatible with real data, that guarantee the existence of provably correct estimating algorithm (without assumption, the problem is ill-posed). The proposals of the authors are convincing : they consists in significant improvements of previous work. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I appreciate this paper for the issue it studies, for the relevance of its proposals and for the rigor and quality of the writing. Detailed remarks: - Could you precise the sentence "If one assumes ... from the mixture" in the second paragraph of the introduction? (line 75-77) - I think there is a mistake in the statement of Def. 9: the class F dos not appear. (lines 301-309) - Line 382: Lemma 1 -> Lemma 6. In supplementary material: - line 1040: The second [0,1] should be [0, \lambda^*] - Line 1116: typo (,.) - Line 1482: h should be \overline{h} in the last formula. =====