Paper ID: 1155
Title: Estimation from Indirect Supervision with Linear Moments

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper studied the problem of learning a structured prediction model based on indirect supervision signals. Instead of using maximum marginal likelihood approach, the authors proposed a moment-based estimation. First, the proposed algorithm solve a linear system to obtain the sufficient statistics to reconstruct the expected observed feature vector based on indirect supervision. Then the model parameters can be obtained by the maximum likelihood principle.   The authors then studied two settings, learning with local privacy constraints and learning from lightweight annotations. For both settings, the authors provided some examples and showed that how to estimate the sufficient statistics. Then the authors described the asymptotic analysis of the moment-based estimator and the maximum marginal likelihood estimator. Experiments showed that the proposed approach achieves good performance especially in the early stage of the learning.  

Clarity - Justification: The paper is easy to follow. The introduction section motivates the problem well. 

Significance - Justification: Learning with indirect supervision is an important problem. To my best knowledge, the proposed approach is new and interesting. However, it is not sure if the proposed approach can handle the problems when $S(o|y)$ is complex (e.g., semantic parsing or the cases in Steinhardt&Liang 15, when modeling s(o|y) is intractable). 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, this is a nice and interesting paper. I enjoy reading it.   Comments: 1. It is unclear to me what is the relation between this work and the previous works that using moment-based methods for latent structured learning.  2. According to my understanding, in the lightweight annotation problems, the authors ignored the transition probability and only used node features (making zero-order Markov assumption). This probably not a realistic setting for most structured problems. I wonder making this assumption is due to the restriction of the model or just for the ease of developing the estimator. 3. Follow the above, it is unclear to me if the proposed approach is general and can handle problems when $S(o|y)$ is very complex (e.g., semantic parsing).  4. The experiment settings and results are relatively weak. Some alternative approaches are missing in the comparisons (e.g., Steinhardt&Liang 15). 5. (minor) Figure 5 is too small to read. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper deals with learning not with standard supervision in the forms of label data but with indirect supervision in the forms of: 1) differentially private data, and 2) lightweight annotations such as label proportions. The key here is to be given the (indirect) supervision function and use it to estimate the sufficient statistics of the exponential family model. 

Clarity - Justification: The paper in general is clearly written. However, I am struggling to understand the context of structured prediction in this manuscript. Line 118 - 129 clearly defined the structured prediction setting where the structures are in the labels y. In the learning with lightweight annotations (Section 4), the independence assumption 2 seems to make the problem un-structured. In the learning with local privacy (Section 7), the pairwise features are defined in terms of x's and not y's. 

Significance - Justification: The motivation is good; how far is the paradigm, given an exponential family model, first of solving a linear system of equations with the knowledge of (indirect) supervision function for estimating the sufficient statistics of the exponential family, and subsequently of solving convex optimisation problem to get the parameter of the exponential family model will bring us. Though, the claims of the usefulness in structured prediction problems are not founded due to: 1) the independence assumption and 2) no proper pairwise features between neighbouring labels. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The asymptotic analysis in Section 5 shows the case for randomised response but Section 3.1 has declared this mechanism unusable for structured prediction problem with exponential sized label space. The analysis for structured randomised response is not provided.   Experiments on Section 7: 1) local privacy: since the structures are on the x's, instead of on the y's, will the Laplace mechanism deliver classic randomised response or structured randomised response? Is it possible to have baseline comparisons with other differential private methods?  2) lightweight annotations: without the edge/pairwise potential functions, we can consider this problem as multi class classification with known label proportions. Can we use direct multi class or one-versus-rest or one-versus-one strategy to compare with Quadrianto et al. (2008)? (Almost) No Label No Cry NIPS 2014 by Patrini et al. seems to address the conditional independence assumption issue (line 840-841).   Summary: Interesting paper but the contribution is minimal/unclear at this stage. The claim of structured prediction needs to be clarified and baseline comparisons with related methods have to be provided. 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a method for learning predictor p(y^i|x^i) with latent y^i, but indirect linear supervision of observation o^i through a *known* stochastic function of a linear combination of the sufficient statistics for p(y^i|x^i).  This allows for backsolving of expectations of the sufficient statistics as linear system, which then permit model parameters for p(y^i|x^i) to be estimated directly through (globally optimal) convex optimization.  Despite the limitations of linear restriction for supervision, the authors show that a differential privacy model and a count-based aggregation of tags in a sequential model both adhere to the linear supervision criteria.  Theoretical results and empirical comparison to marginal likelihood (non-convex) optimization are presented.

Clarity - Justification: Very well-written.

Significance - Justification: I see this paper as a useful special case of moment-based estimators that do not rely on general tensor factorization methods.  That said, I'm not convinced the limited class of models or the advantages of this  method will see it used widely in practice (more in detailed comments).

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is well-written (though dense at points) and makes a variety of contributions as outlined in the summary.  It is a solid paper that should be seriously considered for acceptance at ICML.  My primary concern with the paper relates to its significance and hence I have the following questions:  (1) Is anyone really going to use the differential privacy model or the aggregated tag model or another variant of the linear supervision framework?  The tagging model seems quite contrived to me, but perhaps the privacy problem has more immediate application?  (2) Do the results in this paper follow directly from previous tensor factorization/moment work when the supervision is linear?  (Certainly the theoretical and empirical analysis are nonetheless useful, but it seems the key contribution is just a special case of previous work? Is this the paper that should have preceded the tensor factorization work as a simpler case?  I can see the didactic value of having this work as a stepping stone to the tensor work.)  (3) The theoretical analysis reveals conditions under which the moments approach (MOM) can be seen as a single iteration view of EM. Presumably then, under noise and aliasing, we expect MOM to be much better than 1 iteration of EM?  (4) While an argument can be made for 1-pass algorithms, thanks to modern distributed and GPU architectures, multiple passes over "Big Data" are not unrealistic, especially if they yield superior results. Is there any other justification for MOM over MARG than the mild  computational benefits if empirically MARG seems to outperform MOM with enough passes? 

=====