The authors would like to sincerely thank all the reviewers for giving valuable suggestions. We are also deeply grateful that many typos are pointed out. We apologize for the typos and will thoroughly proofreading the final paper.$
We noticed the concerns from Rev #5. It seems there are some significant misunderstandings and we hope our response can make sense to the reviewer. We hope that the reviewer can reconsider the conclusions made based on the misinterpretations.

Rev #3
Comparison to existing regularizations:
Yes, we will supplement additional experiments. In fact we did try a baseline with L2 + dropout, whose result is not as good as ours.

Q1 The mistake is revised.

Q2 Learning rate and iter number are determined by train/val split of training set. When training for LFW verification, we use one training sample per person in WebFace for validation. For all other datasets, we adopt 4:1 train/val split.

Q3 All experiments are performed with parameter (weight decay) of 0.0005 for L2-reg, the default parameter in many works and Caffe. Experiments show weight decay only slightly affects performance. More importantly, we found it influences in almost the same way on both Softmax and L-Softmax.

Q4 The performance won’t break down. It increases with bigger m but the gain will be marginal. As mentioned in line 328/329 and Eq (7), back-propagation requires cos(m\theta) be decomposed to polynomials of cos(\theta_j) since cos(\theta_j) can be transformed into f(w_j) = (w_j^Tx) / (||w_j||*||x||). Having a larger m directly increases the back propagation complexity and m=4 is a good tradeoff.

Rev #4
As mentioned, the reason is related to back-propagation. With integer m we can decompose cos(m\theta_j) in to polynomials of cos(\theta) (Eq (7)), and this facilitates our back-propagation. We will also improve the analysis in sec3.3.

L-Softmax for other models
L-Softmax indeed can be used for conventional models with feature learning. The reviewer may also refer to our response to Rev #5.

Rev #5
The authors would like to sincerely thank the reviewer for carefully looking into this work. Although we disagree on some of the comments, we are still very appreciated for receiving different opinions to improve the paper.

Range of \theta
There isn’t any technical problem. Eq. (5)(6) already defines the cost for \theta within [0, pi], and the angle between two vectors cannot be greater than pi.

Significance
We will greatly appreciate if the reviewer can elaborate why the contribution is considered marginal. Rev #3 and #4 summarized what we think in significance-justification. We don’t think the improvement is trivial consider its potential impact on DL research.
 
L-Softmax for other models and Connections between L-Softmax &amp; CNN
The authors are aware of such generality and we fully understand the reviewers’ concern. But more importantly, there isn’t any over-claim of the proposed loss in this paper. It is quite obvious from the very beginning that this paper is targeted as a deep learning work. And we have been very careful not to over-claim our contributions by including conventional models (although it can be used to conventional models). Considering the state-of-the-art performance of deep CNN in large-scale recognition, we feel that many readers will be very interested in pure deep learning works. This work share similar styles with a lot of previous DL papers in ICML and NIPS.
So this brings to the issue of “Why CNN?” With all due respect, the authors cannot agree that “emphasis on CNN is problematic”. The very key to why L-Softmax can work considerably better over softmax is because of feature learning, rather than classification. Deep CNN is not just about classification, but is also about feature learning. Requiring learned features to have large inter-class angular margin can be very difficult, that’s why one needs the strong representation ability of deep CNN. We have made this very clear in the paper, and we have tried to emphasize this and justify our claim by having visualizations on the learned features (See Fig. 2 &amp; 5).
We agree with the reviewer on the concern, but it may be unfair to deny the contribution of this work based on the concern.

Comparison to Tang 2013
We are aware of this work. In fact we already have baselines with exactly the same CNN architecture except having a linear SVM loss on top. This can be easily implemented using Caffe (Caffe has a hinge loss layer which exactly implements this). Besides that the work differs considerably from ours, we did not include it because it doesn’t even work better than softmax, although we have very carefully tuned to make the baseline as good as possible. The baseline generates error rates of 0.47%, 9.91%, 6.96%, and 32.9% on MNIST, CIFAR10, CIFAR10+ and CIFAR100 respectively. We are happy to include citation and results in the final paper.