Paper ID: 257
Title: Large-Margin Softmax Loss for Convolutional Neural Networks

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): For training convolutional neural networks (CNN), this paper proposes changing the commonly used softmax loss (cross-entropy loss with softmax) to a generalized large-margin softmax loss, called L-Softmax, which encourages intra-class compactness and inter-class separability between the learned features.  The L-Softmax loss is based on introducing an angular margin to the angle between a sample and the weight vector of the target class. 

Clarity - Justification: Except for the incorrect citation of some figures/tables, the paper is generally easy to follow.  The proposed method is quite easy to understand. 

Significance - Justification: Although the notion of angular margin has been proposed before (e.g., for the maximum vector-angular margin classifier by Hu et al. in Neural Networks, 2012), the formulation here is different and seems to be novel.  From the experiments the L-Softmax loss seems to be quite effective and is easy to implement.  Although it is proposed for CNNs, in fact it is not limited to deep learning models.  For example, it may also be used for conventional neural network and logistic regression models. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Is it really necessary to restrict m in Eq (3) to an integer greater than 1?  Can it be any real number greater than 1?  Is the current restriction imposed to make it easier to design the \psi function in Eq (5)?  Last paragraph of Section 3.3 (second and third cases in Figure 4): the analysis is not very clear.  #375: should be \theta_1 < \theta_2 #377: should be m \theta_1 < \theta_2 #651 & #653: which figure are you referring to? #697: should be Table 3  Since the proposed method can also be applied to other methods, the authors are suggested to add some experiments for conventional models such as logistic regression.  Some language errors: #139: “current softmax loss do not …” #146: “decision rule of at testing …” #207: “Our experimental validates …” #224: “loss input …” #276, #285, etc: “more rigor” #366: “Geometry Interpretation” #368: “a angle margin …” #524: “generalize softmax loss” #760, #766, etc: “filter number” (should be called “number of filters”) and more. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes large-margin variation of softmax loss function. The proposed generalized softmax loss change the decision boundary of the ground-truth region adaptively during the training so that the margin between ground-truth region and those of negative classes to be large. The proposed method is evaluated on several visual recognition benchmark, showing consistent improvement over baseline.

Clarity - Justification: The paper is well written. The method, experimental evaluation protocol and results are mostly clearly presented. 

Significance - Justification: The softmax loss function is arguably the most frequently used loss function in training deep neural networks, and the proposed generalized softmax formulation showed consistent improvement over the standard softmax function on the variety of dataset and network architecture.  Comparison to existing regularization method (e.g., dropout, l2 regularization) could strengthen the paper. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The generalized softmax loss is proposed in this paper. The paper provides a good interpretation of the proposal in comparison to the baseline softmax loss and is supported with a strong empirical results. Specifically, the proposed loss function showed consistent improvement of the performance on several visual recognition benchmark with (slightly) different network architecture.  I only have few questions and comments:  1. The gradient derivation is not precise in line 465-466. There needs a summation over f_{j}'s for the gradient of L_{i} w.r.t. x_{i} or W_{yi} (there is also a typo).  2. How did you perform cross-validation on MNIST, CIFAR-10 or CIFAR-100?  3. Does L2 weight regularization on the softmax weights affect the performance of L-softmax loss?  4. Does the performance break down when m becomes larger than 4? It'll be good to comment on this.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper claims the cross-entropy loss causes over-fitting, since it does not encourage intra-class compactness and inter-class separability. A new loss function is proposed as an improvement to the original cross-entropy loss. The authors justified the improvement with thorough analysis, geometric interpretation and visualization. Sufficient experiments were conducted to verify their claims. 

Clarity - Justification: - The authors claims the new loss is for CNN. But in the paper I failed to see any connection between this new loss and CNN. It looks to me this new loss is quite general, applicable to any softmax output based classifiers such as MLP and softmax regression, rather than specialized for CNN. - I failed to see the relevance of Figure 1 to the claims of this paper. - Typo: in line 398-399, "||W1|| > ||W2|| and ||W1|| > ||W2||" - "more rigor requirement" should be changed to "stronger requirement".

Significance - Justification: The improvement compared to the original softmax loss is marginal. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - The study of loss functions of neural networks is both theoretically interesting and valuable to practitioners. I refer the authors to the following papers for more relevant work. Objective functions for training new hidden units in constructive neural networks  (Kwok et al. 1997) Is L2 a Good Loss Function for Neural Networks for Image Processing?  (Zhao et al. 2015)  - A technical problem is in eq. (5)(6). What if \theta is out of the range defined by (5)(6), does the result still hold?  - The emphasis on CNN in the paper is problematic. Although CNNs with softmax loss are indeed common, CNNs have no theoretical connection with the new loss function. It looks to me that this new loss function can be directly applied to any other classifier with softmax outputs. The readers would be interested in seeing how well this new loss perform with classifiers other than CNN.  - The following work which addresses the similar issue is not compared or even mentioned. Deep Learning using Linear Support Vector Machines (Tang 2013)

=====