Paper ID: 818
Title: Train and Test Tightness of LP Relaxations in Structured Prediction

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): Linear program (LP) relaxation is one common approach for approximate inference in structured prediction tasks, which can often be posed as integer linear programs (ILP).  This paper investigates an observed phenomenon that solutions to the relaxed LP are often solutions to the ILP as well.   

Clarity - Justification: The paper is clearly written overall.  However, I have doubts on whether the explanations presented are the clearest/simplest available and on the measures used for evaluation.   

Significance - Justification: Structured prediction is an important area of research and relaxations to make inference practical are of significant importance for machine learning research. The paper’s investigation of LP relaxations is connected with and motivated by a large amount of related work. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I’m confused by why the integrality question for LP relaxations is considered to be a mystery in this paper.  From the maximum margin perspective, seeking a “margin” that is competitive against only integer-valued solutions will induce learned parameters that seek to make training data y’s better (by a structured margin) than the other integer solutions.  Training under the LP relaxation will induce parameters that make the training data y’s better than all continuous-valued solutions.  The latter guarantees integer solutions in the (relaxed) LP when the hinge loss is zero, while the former can have a hinge loss of zero without integer solutions in the LP.  How closely is the integrality level related to the number of support vectors, margin width, etc.?    The Hamming distance is used as the task loss in the experiments, while F1 score is reported.  SSVMs for F1 exist, so it warrants discussion why there is a mismatch between the task loss and evaluated measures.  Why is this useful to report?  The experiments are limited to two datasets, Yeast and Scene, and what appear to be similar task losses.  It’s difficult to appreciate the generality of the explored phenomenon when only two real datasets are investigated. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): One technique for approximate MAP inference in general MRFs is to replace the marginal polytope with a local marginal polytope. The resulting optimization is an LP (vs an ILP for exact inference), which can result in non-integral solutions. The gap between the score of the LP solution and the optimal integral solution is known as the integrality gap. In practice, we often find that the LP solution is “tight” — meaning, the integrality gap is zero. This paper offers a possible explanation of why LP relaxed inference is often tight, by upper-bounding the integrality gap by the hinge-loss. Thus, minimizing the hinge loss during training, with LP relaxation for inference, effectively minimizes the integrality gap. Moreover, the paper shows that the integrality gap at training generalizes to the population (i.e., in expectation over draws of examples). To verify the results, two experiments are conducted, which demonstrate that minimizing the hinge loss results in low integrality gap. A side result is that training with exact inference, but using LP relaxed inference at test time, does not minimize the integrality gap.  My only concern is that the generalization bound may not be very informative, due to the choice of loss function. The integrality ramp loss depends on the “integrality margin” (i.e., difference of the best integral score, I*, and best fractional score, F*) being sufficiently positive. If the integrality margin is less than zero, the ramp loss is 1; if this happens frequently enough, the bound is vacuous. The paper gives two examples of models for which the integrality margin has any hope of being positive, as well as empirical evidence that it is possible. Unfortunately, I found the theoretical explanation a bit confusing, and the empirical results not all that convincing.  I have some questions (see the detailed comments) that I would like the authors to respond to during the response period. I also have a suggestion for an alternative analysis that might avoid the need for the $\gamma$-tightness assumption, and could possibly yield a more informative LP tightness generalization bound.  == UPDATE ==  The author response has answered my doubts about the integrality margin. Apologies for the confusion; I did not pick up on the fact that only the _vertices_ of $M^F$ (i.e., the optima of loss-augmented inference) are used in the integrality loss. Now it all makes sense.  I do still think that the structured ramp loss analysis would be more informative and more elegant, but I suppose this could be another paper, or at least an extension for a journal version. 

Clarity - Justification: The results are presented clearly, with intuition and explanation. The writing is excellent. All around, I believe this to be a strong submission.  The only part of the presentation that was lacking was in Sec 4.2.1, and the accompanying proofs in the supplementals. I found this to be a bit hard to follow. 

Significance - Justification: The idea for this paper is intriguing and relevant to many facets of structured prediction. I believe it would have only moderate impact, since there aren’t really any new suggestions for practitioners (other than to train with hinge loss). Nonetheless, I think the paper contains significant theoretical insight. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1) Regarding Prop 4.4: Suppose the edge potentials are zero and all node potentials are equal to $\beta$. This model satisfies the conditions of the proposition. (Right?) Thus, the best integral score is $n \beta$. (Still right?) We can get the best fractional score by subtracting an infinitesimal weight, $\epsilon$, from each $\eta_i$, making each over-complete node assignment $(\epsilon, 1-\epsilon)$). (Still right?) Then the fractional score, $n \beta (1 - \epsilon)$, can be arbitrarily close to the integral score as we shrink $\epsilon$. Am I missing something?  2) Let’s assume that I am missing some key detail, and that the results of Sec 4.2.1 are both correct. Are the models that satisfy the conditions of Propositions 4.3 and 4.4 interesting situations in which the fractional solution isn’t trivially tight? That is, are these settings in which the LP is not already guaranteed to be tight?  3) In both experiments, the integrality margin for most examples is < 0.1. If you take $\gamma=0.1$, most examples will have nonzero loss, and the bound is multiplied by 10. If you decrease $\gamma$, the empirical loss might improve, but the bound becomes looser.  4) I suggest analyzing a different loss function, based on the *structured* ramp loss (see, e.g., Keshet & McAllester, NIPS’11). First, consider a modification of the structured hinge loss where the ground truth is the integral MAP state, $y_I = \argmax_{\mu \in M^I} \theta^T \mu$:    hinge(\theta, y_I)  =  \max_{\mu \in M_L}  || \mu - y_I ||_1  +  \theta^T \mu  -  \theta^T y_I  This loss wants the margin, $\theta^T (y_I - \mu)$, to scale proportionally to the L1 distance between $y_I$ and $\mu$ (i.e., L1 distortion). Now consider a modification of the structured ramp loss:    ramp(\theta, y_I)  =  \max_{\mu \in M_L}  || \mu - y_I ||_1  +  \theta^T \mu  -  \max_{\mu’ \in M_L} \theta^T \mu’  Since $M_L$ includes the integral solutions, the LP solution’s score, $\max_{\mu’ \in M_L} \theta^T \mu’$, is at least the score of $y_I$; thus, $ramp(\theta, y_I)  \leq  hinge(\theta, y_I)$. Moreover, it is easy to show that $ramp(\theta, y_I)$ upper-bounds the L1 distance from $y_I$ to the LP solution; i.e.,    || y_I - ( \argmax_{\mu \in M_L} \theta^T \mu ) ||_1  \leq  ramp(\theta, y_I)  Finally, the ramp loss is upper-bounded by the maximum L1 distortion, which is proportional to the number of variables. (If you want, you can normalize the L1 distortion by dividing by $n$.)  Therefore, by applying the generalization analysis to the structured ramp loss, you have that the expected L1 distortion of the LP solution is upper-bounded by the empirical structured hinge loss, plus the generalization error term (which of course vanishes with infinite training data).  This method has two benefits: 1) you don’t need to make the $\gamma$-tightness assumption for some constant $\gamma > 0$, since the margin just has to scale proportionally to the L1 distortion; 2) you get a bound on the expected L1 distortion of the LP solution, which is arguably more meaningful than the 0-1 loss of the integrality gap.  5) The claim “one could learn supermodular models by enforcing linear inequalities” could use further explanation.  6) Are you sure that Shalev-Shwartz et al. (2011) is the correct citation for the comment in lines 545 - 549? I wasn’t aware that this paper contained any generalization results. All I see are regret bounds for the learning objective. Maybe I’m missing something. 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper provides theoretical results to address the puzzling fact that test-set LP relaxations of learned structural models are tight more often than not, when intuitively one could expect them to be otherwise.  More specifically, models trained using LP relaxations for inference tend to be tight on the test-set. This paper shows two related things: (1) that training with LP relaxations often drives the integrality gap down (excepting some pathological cases such as the one showed in the paper), and (2) that the integrality gap can, in many circumstances, be expected to generalize to the test set in a similar way as accuracy can.

Clarity - Justification: The paper is very easy to follow. Results are derived intuitively and with proofs, counter-examples are explored, and it builds on previous work. The experimental results explore the space sufficiently well.

Significance - Justification: This paper is the first to provide a theoretical explanation to this often-observed phenomenon. One can imagine future work on trying to directly use the fact (first, AFAIK, proved in this paper) that integrality of solutions generalizes to the test set in interesting, nontrivial ways.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is clear, concise, and seems correct. It explains a puzzling observation made many times in the literature, and leaves open avenues for future work.

=====