We thank the reviewers for their comments. We now answer their questions/concerns. $
To R1 w.r.t. novelty of Theorem 1: 
- We would like to clarify that we do not do backpropagation as the task loss functions we are interested in are non-smooth and non-decomposable, and thus are not amenable to gradient based methods. 
- McAllester et al.’s original theorem cannot be applied to neural networks because of the strong dependency of their proof on the linearity of the scoring function. Lemma 1 and lemma 2 in our paper are the key elements to proving the non-linear case. 
- The proof is technically dense due to its complexity, but the basic idea is to exploit the definition of gradients and expectations, which is why our proof and McAllester’s look similar at a superficial level.
- We’ll follow your suggestion and discuss the theorem more carefully in the main text.

To R1 w.r.t. why hinge loss is sensitive to noise: The hinge loss of each outlier is much larger than the cost measured by the 0-1 loss (which is at most 1). This is why the gap between hinge loss and target loss is large in noisy situations.

To R1 w.r.t. an old training algorithm applied to neural networks: We kindly disagree, we think it is a contribution to enable an approach that could only be used for linear models to be amenable to neural networks.Direct-loss training  has not had a significant impact in terms of broad ML applications, but with our work we believe it has the potential to do so.

To R1 w.r.t. what’s new compared to McAllester’s experiments: 
- We generalized the theory and applied it to neural networks, while the original work could only be employed for simple linear models.
- We reported that the positive case is better than the negative one, contrary to statements in the original paper. In addition, we included empirical explanations for this phenomenon in the supplementary material.
 - The original paper applied direct loss minimization to a phoneme alignment problem on the TIMIT corpus, and compared to a few published benchmark results. The experiments we present here are considerably more extensive, involving some analysis of the method on synthetic data, and a couple different real applications with extensive comparisons to related approaches.

To R1 w.r.t. significance of the dynamic programing algorithm: Without this approach we couldn’t have examined the negative update of direct loss minimization, and this work wouldn’t have been complete. Moreover, we think the flexibility of this new approach is a contribution in its own right. We could reduce the space dedicated to this new algorithm in the paper if the reviewers and AC feel that it will benefit the paper. 

To R2 w.r.t. source code: We already have an integration with caffe and will release all our source code upon acceptance.
We were inspired by computer vision applications, where we believe this work would make a big impact. 

To R2 w.r.t. normalized initialization: Our results are fair, as all methods have the same initialization. Thanks for the suggestion, we expect normalized initialization to affect all methods in a similar manner. 

To R2 w.r.t. Fig.3: We meant the number of iterations on the training set.

To R2 w.r.t. citation. Thanks! We will correct it.

To R2 w.r.t. action classification task: We used ground truth bounding boxes for this task and will explicitly state it in the paper.

To R2 w.r.t. intuitive explanation: Please see our response to R1.

To R3 w.r.t. synthetic experiment: The reviewer has the right intuition about the fact that the data generating process indeed made the ranking more difficult.