We thank the reviewers for their feedback and suggestions.$We address reviewers' questions below.

R1:
Generalization bound may not be very informative:
- “if the integrality margin is less than zero, ... then the bound is vacuous”
Indeed, if many instances are fractional (negative margin), then our bound will be loose. But this actually makes sense: we cannot expect test instances to be tight when many training instances are not. On the other hand, if for many instances the relaxation is tight with a margin, then our bound will be meaningful. As the reviewer points out, sometimes the margin can be small. In that case, more training samples are needed in order to obtain a tight bound. We remark that the margin requirement exists also in previous related bounds (e.g., McAllester 2007, Weiss &amp; Taskar 2010).
- “The paper gives two examples of models for which the integrality margin has any hope of being positive”
Our analysis provides sufficient conditions under which the relaxation is _guaranteed_ to be tight with a margin. We believe that Prop. 4.3 is likely to hold for many supermodular models used in practice. As mentioned in the review, our experiments show that the margin can be positive even when these conditions are not satisfied.

1) Prop. 4.4: “We can get the best fractional score by subtracting an infinitesimal weight… Am I missing something?”
Indeed, there appears to be a misunderstanding here. In our definition of F* (L451), we optimize over the _vertices_ of the local polytope, excluding its convex hull (L443). Vertices are determined by the geometry of the polytope and do not depend on the potentials. It is known that the vertices of the local polytope in this setting are half-integral (L1196). Therefore, the best fractional vertex will have value at least beta/2 worse than the optimum.
Also note that when edges do not appear in the model at all (as opposed to being present with potential zero), then the polytope has no fractional vertices and by convention we set F*=-inf (L453), so the proposition still holds.

2) We agree this is a subtle point, which we discuss in L620-628 and L852-858. The LP relaxation for the models satisfying the assumptions of Prop. 4.3 and 4.4 _are_ known to be tight (not necessarily with margin). However, these are _conditional_ models, where for each data point the potentials are derived from the features and weights as shown in L220-230. Thus, there is no guarantee that models for test data points would satisfy the assumptions (e.g. supermodularity). The point of this section is to show that if the potentials resulting from learning happen to satisfy these assumptions on the training data, then they are tight with a margin, which allows us to give a bound on the tightness at test time. We will clarify this point and improve the presentation in Section 4.2.1.

3) The empirical margin is small. See reply to first question above.

4) We thank the reviewer for suggesting the structured ramp loss, which is indeed an interesting alternative approach. We will need to consider this point more carefully. For now, we point out that our analysis can be used to obtain bounds with various levels of integrality similar to the suggested L1 measure, as we discuss in L570-579. Further, it seems to us that the parameter sigma in McAllester &amp; Keshet’s Theorem 4 plays a similar role to our margin, and it is not immediately obvious that one can get a tighter bound using their analysis.

5) Learning supermodular scores using linear constraints is studied in Taskar et al. (2004). We will add a reference.

6) From Shalev-Shwartz et al. we only use the bound on ||w*||, which appears at the end of Theorem 1.


R3:
Thank you for the positive feedback.


R4:
1) We agree that when the relaxed loss is zero (the data is algorithmically separable), then the relaxation is tight for training instances. As we discuss in Section 6, this was observed by Kulesza &amp; Pereira. Our analysis adds two theoretical insights to this result. First, we explain tightness of _test_ instances through our generalization bound, which is not implied by the above observation. Second, we discuss extensively cases where the relaxation may be tight even without algorithmic separability (see L348-415).

2) F1 score: the reason for reporting F1 here is label imbalance. Specifically, the number of 1's is rather small, so the all-zero prediction has high accuracy. F1 is then a more informative measure of performance and is commonly used for multi-label classification problems. Our focus here is the computational complexity of inference rather than performance of the predictor. We will clarify this point.

3) Experiments on only two datasets: in addition to the two multi-label tasks (Yeast and Scene), we also showed results for an image segmentation task (end of Section 5). We emphasize that although this setting is quite different to the multi-label setting, we observe very similar behavior.