Paper ID: 997 Title: Training Deep Neural Networks via Direct Loss Minimization Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This work shows that the direct loss minimization approach of McAllester et al. (2010) can also be applied to neural networks. They present a dynamic program to apply loss augmented inference to average precision that allows for both "positive" and "negative" direct updates. They evaluate the direct method on synthetic and two VOC datasets. Clarity - Justification: The dynamic programming section is a bit hard to follow. Also, it's not clear how much of the proof of Theorem 1 is novel vs. simply applying the McAllester work by rote with F instead of phi. Significance - Justification: This paper is about average...the significant of the dynamic program isn't really great because the runtime is the same as prior work, and it turns out that in practice the "positive" update is the useful one (which was already possible). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I'm on the fence for this paper. As I see it, the strengths are (in order of impact) 1. Showing that McAllester generalizes to non-linear optimization. 2. Showing that the direct approach works well in the presence of noise, on real and synthetic data. 3. Showing that positive is actually empirically better than negative. 4. The dynamic for AP. The weaknesses of the above are: 1. I'm not sure how novel this is -- the proof is very technical, but superficially it looks very similar to that in McAllester with a lot more notation. I don't really consider myself qualified to judge the correctness of the proof, but given that we're talking about gradients, the fact that the gradient is the same regardless of whether or not the scoring function is linear seems pretty obvious. E.g. via the chain rule, you'd imagine that you could backpropagate through a linear layer into the rest of a deep network...do you really need a new proof for this case? I would really appreciate some clarifications from the authors on why Theorem 1 is important, and what was required to prove it. I'm willing to be convinced that this is something that's not obvious, and that it's going to convince people to try direct loss who wouldn't necessarily otherwise for neural networks. If it really is important, then it seems to me a significant bulk of the paper should be making this point. 2. The experiments are nice but there's a lack of discussion. Intuitively, it makes sense that the method would be robust to noise, but can you provide some examples, theory, or ablative experiments to tease apart exactly why the direct method is working, but the hinge version is not? Also, aside from Theorem 1, there's nothing specific or unique in this work about neural networks...to be honest, this result is really just a straightforward application of existing techniques to neural networks. It's not a "new training algorithm", it's an old training algorithm applied to neural networks. It works better than Yue et al, but is this new state-of-the-art? We already knew that direct loss minimization works from McAllester. What specifically are the experiments showing that's new? If the accuracies are new state-of-the-art, there should be more baselines to other papers on this dataset. If this paper were mainly empirical, one would expect significantly more experiments, baselines, etc. Are other noisy domains that NN's are used for, where this would help? 3. This is a nice point, but not enough for an entire paper without additional contributions. It also diminishes the impact of the dynamic program. 4. The dynamic program takes up a lot of space, but the impact of this is pretty small. As pointed out in the paper, we can already optimize AP indirectly, using the techniques of Yue et al., so it seems disproportional. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces an extension of a method to optimize non-continuous or non-decomposable losses for general scoring functions such as those produced by a deep neural network. The method is applied to a ranking loss, for which an inference procedure and gradient are derived. The newly proposed direct loss minimization method is applied to two image recognition problems (action classification and object detection), and compared to standard methods (based on surrogate losses), and is shown to yield consistently higher performance. Derivations of the direct loss gradient for general scoring functions are provided in the appendix. Clarity - Justification: The paper is well-structured, and the contributions of the paper are clearly stated. Significance - Justification: The expression of the gradient of a general loss function applied on general scoring functions (provided that the derivation is correct) is a highly significant contribution. Currently, most supervised learning problems make use of surrogate losses, which are suboptimal and need to be selected. Direct loss minimization avoids these problems altogether. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The proof of theorem 1 is complex and I wasn't able to verify it in reasonable time. The authors test their method first on some synthetic ranking dataset where ranking between examples is given by the output score of a random neural network, and where another neural network is trained to match this ranking. For this synthetic experiment, I am wondering whether the neural-network-based data generation process causes some parts of the ranking to be considerably more difficult to predict, due to some ranges of output scores of the initial network being more densely sampled. In the next experiments, the authors consider large image recognition datasets, where direct loss gradients are backpropagated into big pretrained convolutional networks. The experiments are performed on a rich set of minimization methods (perceptron-based, hinge loss, direct loss) or losses (0/1, AP, cross-entropy). The training procedure and parameter selection for each trained models are clearly described. The authors also vary the percentage of label noise in the problem, and demonstrate the particular efficiency of direct loss minimization on problems with a certain amount of label noise, which is an interesting insight. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper shows how to use non-decomposable loss functions for deep neural network training. Standard learning of these networks is mostly done with a surrogate loss, which can be represented as a sum of single loss terms for each training examples. I think the paper is great work and might indeed have important impact especially for the vision community. The authors should release their source code and provide an integration into standard deep learning toolboxes, such as caffe. Clarity - Justification: The method and ideas behind the optimization are well described. However, the description of the experimental results could be improved. Significance - Justification: Being able to learn with non-decomposable loss functions has wide applications especially for vision tasks. The approach especially works in the case of label noise, which is an interesting observation and an important scenario. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - In lines L297 - 303 the authors list important baselines when it comes to directly minimizing average precision. It would be interesting to see a comparison to these approaches. However, I know that this might be difficult to establish since the other methods do not allow for joint deep neural network training. - L491 - L492 Usually "normalized initialization" is used to initialize deep neural networks, since it has been shown to have a significant impact on optimization performance. The authors do not perform such a normalization and I wonder about the effects this might have. - Fig3 "number of iterations on the test set"? - L759 The authors should size Krizhevsky et al rather than Russakovsky. - It is unclear for me whether the authors use the ground-truth bounding boxes for their action classification task (Sect. 4.2) This should be clarified. - Are there any qualitative results (detection or classification) that reveal some intuition about the robustness of the approach with respect to label noise? =====