Paper ID: 960
Title: Mixture Proportion Estimation via Kernel Embeddings of Distributions

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a new method to estimate the mixture proportion of two distributions, say H and G, from the sample where the sample is obtained from the mixture distribution and only H not G. The difficulty is that we do not obtain a sample directly from the distribution G. Estimating the mixture proportion typically appears for positive and unlabeled learning (PU-learning). The proposed method utilizes so called C-distance based on the kernel mean embedding of a probability measure. A convergence analysis of the empirical C-distance and a convergence of the estimator based on the empirical C-distance is proven under the separability condition.

Clarity - Justification: The motivation and the algorithm are well described. It is easy to implement the method based on the given description. 

Significance - Justification: The method is a simple application of the kernel mean embedding, and it is easy to implement. Moreover, the convergence analysis indicates that the method is not directly affected by the curse of dimensionality. This is a significant point of the proposed method.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The problem and the method are significant as described in "Significance" comment. However, I have a theoretical concern about the analysis. The analysis does not utilize the uniform bound. Since we need to optimize lambda, the difference between the population and empirical C-distance should be bounded "uniformly" over all lambda. The current analysis gives only for arbitrary chosen "fixed" lambda. Thus the tools given in the paper are not sufficient to prove Theorem 12 and 13. To make the proof rigorous, this flaw should be fixed.  The convergence rates for the value and gradient thresholding methods are different (O(m^{-1/4}) or O(m^{-1/2})). Is this just by a technical artifact or is this actually essential difference?   In the numerical experiments, the two methods KM_1 and KM_2 show different performance. It would be helpful if there would be explanation about why this difference occurs.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a new algorithm for mixture proportion estimation using kernel mean embeddings.  The paper presents both a practical algorithm and a thorough theoretical analysis.  

Clarity - Justification: The paper is clearly written with no language errors. 

Significance - Justification: The paper proposes a new algorithm for mixture proportion estimation using kernel mean embeddings. The paper presents both a simple, practical algorithm and a thorough theoretical analysis.   Previous algorithms either lacked strong theoretical justification or was difficult to implement (e.g. estimation of derivative of ROC in Blanchard et al. (2010)). This paper clearly improves on the state of the art. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): A weakness of mean-embeddings is that the selection of hyper-parameters is often not possible using the same objective function. What is the motivation for selecting the objective function which maximizes ||\phi(F) - \phi(H)||? 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The authors consider the problem of estimating the       weight of a component in a mixture of two distributions. This       problem, clearly and simply delimited, appears to be a key issue       in a number of learning problems. The paper provides an       algorithm, with proven convergence rates under some conditions       on the underlying distributions, that uses embedding       distributions onto universal RKHS. Experiments prove that this       algorithm performs at least as well as state-of-the-art methods. 

Clarity - Justification: The paper is very clear, very well written and       pleasant to read. All proofs are carefully       written in the supplementary material (and exhaustive), and clearly summarized in       the paper itself. Several figures help to understand the main       results. Comparison with other methods, discussion and       experiments are properly conducted.

Significance - Justification: The issue tackled by the authors is a scientific       problem which is central for a number of practical       developments. The problem is not easy to solve as it needs to       state realistic assumptions, i.e. compatible with real data, that guarantee the existence of       provably correct estimating algorithm (without assumption, the       problem is ill-posed). The proposals of the authors are       convincing : they consists in significant improvements of       previous work.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I appreciate this paper for the issue it studies,       for the relevance of its proposals and        for the rigor and quality of the writing.   Detailed remarks: - Could you precise the sentence "If one assumes ... from the mixture" in the second paragraph of the introduction? (line 75-77) - I think there is a mistake in the statement of Def. 9: the class F dos not appear. (lines 301-309) - Line 382: Lemma 1 -> Lemma 6.  In supplementary material: - line 1040: The second [0,1] should be [0, \lambda^*] - Line 1116: typo (,.) - Line 1482: h should be \overline{h} in the last formula.    

=====