Paper ID: 445 Title: Structured Prediction Energy Networks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose to learn energy functions for structured prediction applications by risk minimization applied to a bilevel optimization problem. Clarity - Justification: See below. Significance - Justification: See below. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Review summary. A well-written paper with a straightforward method. The experiments are limited in that they address only the simplest structured prediction problem, multi-label classification. Review. The paper is well written and addresses an important problem: how to combine deep learning methods with structured prediction in an expressive and efficient way. To this end the authors propose to use the energy-based learning framework and directly minimize the loss of argmin predictions of a parametrized energy function. No technical difficulties are involved, but also no guarantees are provided. Two related works are absent from the discussion: (Chen et al., arXiv:1401.4107) and (Kunisch and Pock, 2012) already apply energy based learning to structured prediction tasks in the generality proposed by the authors. Another related work that should be discussed in the context of Section 1 because it also directly predicts using gradient descent on learned quadratic functions is the work of (Jancsary et al., ICML 2013). Section 3 is poorly written: the definition of the SPEN is an unguided sequence of energy functions defined on-the-fly. Are the energies summed? Why this particular decomposition? Perhaps a data-flow figure or a computation graph can help provide some structure for this section. The experiments are the weak part of the paper: although the authors claim full generality and applicability of their SPEN approach to structured prediction problems, only the simplest possible structured prediction setting, multi-label classification, is addressed. One troubling aspect in the experiments is that all experiments involve loss functions that linearly decompose over the individual prediction variables. Consequently, the optimal Bayes predictor for each task has a decomposable structure, i.e. it can be written as y_i = f(x), for all i, so that it becomes conditionally independent. This is a fundamental point: as a model, a structured prediction model can only add value for non-decomposable loss functions. All structured prediction models can add for the case of decomposable loss functions is an inductive bias. The authors observe this casually for one experiment in the caption to Table 3, but do not seem to realize the larger consequences. As a result, the method only achieves comparable performance to other common approaches, without offering any significant advantages. Minor details. Line 249 and (6) in line 303: f is used ambiguously. In (6) the meaning of f is left unspecified (do you mean F?), and in line 249/279 it is a constant denoting dimensionality. Line 263, the function is $g$, or in semi-lambda notation $g(.)$, but not $g()$. Line 388, (8). Line 601, "jack-knifing" is ambiguous; do you perform leave-one-out estimates on the training set, or is the train-test split used to this end? Line 603, "selected". ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a deep structured model. Prediction is based on energy minimization, where the energy function is parameterized by a neural network. The proposed method performs energy minimization using gradient-based optimization on a convex relaxation of the prediction problem. This is different than most previous approaches which either directly use an inference algorithm for prediction, or unroll part of the computation graph of some inference algorithm and then use back-propagation to learn model and inference parameters jointly. A specific such model is presented for the multi-label classification problem, and experimental results are shown for several benchmarks. Clarity - Justification: The presentation is clear and I could follow the paper without much effort. Significance - Justification: The idea of deep structured models using energy minimization via gradient descent is not novel, and was proposed by LeCun et al. (2006). The paper is honest about this fact. I get the feeling that the proposed approach is too concrete in the sense that it is a special case of the general framework of LeCun et al. (2006). It would be desirable to show that the approach is general by specifying it in other settings, such as computer vision, instead of focusing only on multi-label classification with a specific architecture. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Comments: * A notable drawback of the proposed method is that prediction at test time is not done in a fast feed-forward manner, but rather through more expensive gradient-based optimization. This is again related to the question of how the gradient-based approach works for settings other than multi-label classification, where the prediction task is potentially harder. * One interesting question that I think does not get enough attention in the paper is the effect of the choice of neural architecture. For example, can prediction accuracy improve by adding more layers in (3) or (5)? Currently it only has 2 layers so calling it “deep” is not entirely justified. * Another relevant citation for back-propagation through limited inference steps related to the mean-field baseline presented here: Mean Field Networks, Li & Zemel, ICML Workshop on Learning Tractable Probabilistic Models (2014). Minor comments: Line 195 - “a various domains” Equation (8) - I think that the [.]_+ operator can be removed (assuming as usual that Delta(y,y)=0) Line 769 - “We also because” ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This submission generalizes existing deep learning and structured prediction frameworks into what the authors refer to as `Structured Prediction Energy Networks’ (SPENs). SPENs are energy based models, non-convex in both the data x as well as the discrete/binary output space y. To deal with SPENs the authors first relax the discrete output space. Inference then amounts to repeated back-propagation operations to perform gradient updates. Learning involves a double loop algorithm to first perform loss-augmented inference before performing a gradient step w.r.t. the parameters of the model. In case of a discrete task-loss function, a differentiable surrogate is necessary. Clarity - Justification: The paper is well written and easy to follow. At times the explanations are rather lengthy and space could be spent on topics that are more interesting. Significance - Justification: Despite the computational complexity and the necessity for careful initialization, SPENs are attractive because possible correlations between pairs of variables don’t have to be modeled by a user. The proposed procedure will learn them automatically. Moreover, the procedure can handle large label spaces easily which might be troublesome when using, e.g., standard graphical models. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Details: 1. When interpreting the relaxed output space variables as probabilities, entropic mirror descent seems reasonable. Nonetheless I’m wondering whether the authors compared to the classic projected gradient descent. 2. How did the authors measure convergence for the inference optimization? 3. An interested reader might want to read a little more about the way discrete task-losses are employed in the presented framework. Could the authors comment? Minor comments: - it should be \min_{\bar{y}} in Eq. 2 =====