We thank the reviewers for their thoughtful feedback and comments. The consensus seems to be that our two-step moments-based approach (MomML) is well-motivated, interesting, and clear, but the primary concern is the usefulness and applicability of the method to structured prediction, which we address below.$ Reviewer 1 notes that in the lightweight annotations, our model p(y|x) is actually fully-factorized and questions whether we are actually doing structured prediction. However, since we are learning from indirect supervision o, we need to perform inference in p(y|x,o), and o couples all the y variables. Therefore, approximate inference techniques would typically be needed here - this is why we still consider this problem to be within the realm of structured prediction. We show that MomML obviates the need for this; furthermore, it eliminates the need for non-convex optimization. Reviewer 1 notes that the lightweight annotations is similar to the learning from label proportions setting. There is a crucial difference: the bags in our setting would correspond to contiguous subsequences of positions. While p(y_i | x_i) are independent, the marginal distribution over p(x_1, …, x_L) is arbitrary, and indeed, nearby words in a sentence are highly correlated. Learning from label proportions requires either the independence assumption of Quadrianto et al. 2008 or the distinguishability among bags assumption of Patrini et al. 2014 which do not apply in our lightweight annotations setting. Reviewer 1 also notes that in the local privacy experiments (Section 7), “the pairwise features are defined in terms of x's and not y's.” The graph is fully connected, but y is not a structured object here (even handling this case in the context of local privacy is novel to the best of our knowledge). Our technique generalizes to arbitrarily structured graphical models on x’s and y’s. Reviewer 1 asks us to compare the Laplace mechanism with classic randomized responses. We use a Laplace mechanism coupled with structured randomized response--choosing a single clique--to construct our estimator. Baseline randomized response and standard differential privacy mechanisms (in the local privacy setting where individual data points are obfuscated) have error growing linearly in the number of problem parameters, which without structure is exponential (so that its error is extraordinarily high); cf. http://arxiv.org/abs/1305.6000. Reviewer 1 writes that we provided asymptotic properties only for the classical and not structured randomized response. In fact, we give an asymptotic analysis for general observation functions, which includes structured randomized response (Proposition 2). We simply gave classical randomized response as an example to provide intuition (Lemma 1). Reviewer 2 asks whether our MomML technique handles more complex settings such as Steinhardt & Liang (2015). The count annotations is exactly their “unordered translation” setting, and we show that we can perform estimation in this model without non-convex optimization or approximate inference, which Steinhardt & Liang (2015) resort to. Reviewers 2 and 3: Regarding the relationship of MomML to other moment based estimators, particularly tensor methods: MomML involves a different set of moment equations as compared to typical tensor methods that use third order moments. Presented results do not follow directly from previous tensor methods. Our MomML estimator is much simpler, involving solving only a linear system, and requires no tensor factorization. Reviewer 3 found the count annotations setting for the tagging model to be “contrived”. We would like to point out that the count annotations is a natural extension of the popular multiple instance learning setting to structured prediction, in which one observes aggregate quantities over multiple variables. Similar ideas are found in posterior regularization, and these ideas have been exploited quite extensively in practice. Reviewer 3 notes that MargML (approximately solved by SGD) outperforms MomML with enough passes over the dataset and asks what the benefits of MomML are. First, unlike MargML, MomML has strong theoretical guarantees, which provides assurance that the performance is completely due to modeling / estimation error rather than local optima. Second, note that the gap in favor of MomML over MargML would widen when the inference becomes more difficult (say with larger window sizes), since MomML requires no inference whereas MargML requires (approximate) inference. Reviewer 3 asks whether MomML would be better than one iteration of EM. If the observation function S is deterministic, then MomML is equivalent to one iteration of EM. Otherwise, we do not know of any equivalence.