#Hardness/Simplicity of Multi-Label Classification#$We disagree with the reviewers’ perspective that multi-label classification (MLC) is ‘simple’  structured prediction. MLC is more complicated than something like a grid-structured problem, since we require both structure and parameter learning. With grids, the topology of variables provides strong inductive bias about variables’ interactions, while no such structure is available for MLC. Trading off model expressivity vs. parsimony is also more difficult for MLC than for grids, since MLC does not allow for parameter tying.  For grids, however, the number of parameters to estimate typically does not increase with the number of variables.

#Breadth of the Experiments#
There have been a multitude of structured prediction papers with experiments that only consider a single application domain and problem structure.  Since automatic structure learning is a strong capability of SPENs, we chose to focus our experiments on MLC. This is analogous to how many deep learning papers have focused on individual domains, where automatic feature learning is important, eg. speech recognition.  MLC alone is problem with broad interest and applicability to the ICML community. 

**Reviewer 1**

#Mirror Descent vs. Projected Gradient#
An advantage of mirror descent is that its iterates never reach the boundary of the simplex. This is necessary for training with the log loss, which diverges at the boundary, as our differentiable surrogate loss. At test time, we could switch to using projected gradient descent. In experiments, the speed is comparable. Theoretical properties of entropic mirror descent’s advantageous convergence rate vs. projected gradient aren’t particularly useful here because we are optimizing over a cross product of very low dimensional simplices (for MLC it’s 1 dimensional). 

#Measuring Convergence#
We tracked relative changes in the optimization objective and absolute changes in the iterates’ values, and terminated if either of these was below a small threshold. 

#Discrete Task Losses#
We use F1 at test time. Here, we round each coordinate of \bar{y} from [0,1] to {0,1} using a threshold tuned on dev data. Such thresholding is general and can be used for all discrete testi-time losses. 

**Reviewer 2**

#Related Work#
Thanks for the pointers. We will certainly add a discussion of these 3 papers to the camera-ready. The primary advantage of our technique vs. the first two references is that ours better provides users with the opportunity to quickly explore different architectures, since SPENs don’t require rederiving new architecture-specific. Jancsary et al. is similar to SPENs, but doesn’t generalize naturally to modeling higher-order interactions among variables. 

#Sec. 3#
Thanks for the suggestion to improve the organization. Yes, the terms are summed. We chose this architecture because it parallels standard structures for CRFs that contain both potentials on multi-variable cliques and unary potentials on individual variables. 

#Decomposable Loss Functions#
Like you said, our primary motivation in using a structured energy function is to provide additional inductive bias, which is clearly important even for problems with decomposable loss functions. There is a wide variety of important problems that fall into this category. Also, our technique works for non decomposable loss functions with differentiable surrogates. 

#Jackknife sampling#
We sampled a new train-test split for tuning by pooling the train+test files and sampling without replacement to make a split of the same size as the original. We used the wrong terminology, and will fix.

*Reviewer 3*

#Prior Work#
We were eager to acknowledge the influential framework in the Lecun et al., 2006 survey/tutorial. Unfortunately, doing so conveyed an inaccurate impression about the novelty of our contribution.  The idea of using gradient-based methods for structured prediction is only mentioned in a single sentence on page 5. Instead, Lecun advocates parametrizing the potentials for factor graphs, dynamic time warping, etc. using deep architectures and using special-case methods like BP. Our energy network is similar to Lecun’s ‘implicit regression’ or siamese networks.  However, it is novel to apply this for structured prediction, or do implicit regression using gradient descent. 

#Iterative Computation at Test Time#
Many popular structured prediction techniques perform iterative optimization of a non-convex energy, including BP and mean-field. The iterative nature of SPENs is usable both conceptually and practically. Our timing numbers support this. 

#Deepness#
Adding more layers to the energy is trivial code-wise. We didn’t make our energy functions deeper simply because we had limited training data, and an overly-expressive energy function resulted in substantial overfitting. 

#Li &amp; Zemel#
We’ll add a citation. Thanks for the pointer.