Paper ID: 129 Title: Multi-Bias Non-linear Activation in Deep Neural Networks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces a simple twist by adding a "multi-bias" non-linear activation layer. The idea is to just replicate the feature map and having a set of K learned biases for each replicated feature map. This works well on CIFAR-10/-100 and MNIST. Authors analyze and compare/contrast with similar methods as well. Clarity - Justification: Paper is well written, with a clear story and well-justified. Significance - Justification: It's surprising that such a simple trick works so well, but because it's so simple and easy to try, if it does turn out to be effective, it will be a popular technique. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - I am surprised that the authors claim that the technique is "low cost in both the number of parameters and computation.". It's definitely true that it's low cost in terms of parameters, but given that each bias creates a new feature map, doesn't having K=4 increase the number of convolution applications by 4x for the layer above? - It'd be nice to compare more thoroughly the cost of MBA vs APL (computational cost). - Is the MNIST result the best the authors got with MBA? - I think that the fractional max-pooling paper (http://arxiv.org/abs/1412.6071) gets slightly better results on CIFAR-10, but is likely harder to implement than the MBA method. All in all the results seem compelling and analysis seems solid. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a new activation function for deep convolutional neural network, which adds multiple biases for every feature map before a regular ReLU. The added biases are considered parameters and learned through back propagation. The paper also shows promising results on CIFAR-10 and CIFAR-100. Clarity - Justification: Overall, the paper is easy to follow. However, more explanations are needed for notations. For example, the variable N, K and M on page 3 and w^\prime on page 4 are used without any definition, which are confusing. And please make the font size larger for figures, such as Figure 4 and 5. Significance - Justification: From my perspective, the technical contribution of this paper is an incremental advance compared with the adaptive piecewise linear (APL) activation function (Agostinelli et al. 2015). The proposed MBA is different in that it learns a different piecewise linear activation function for every feature map while APL learns a same activation function. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The proposed MBA is well-motivated in that an uniform bias for a single feature map might not be sufficient to extract all patterns. A standard CNN with more feature maps is capable of addressing this issue, which, however, requires more convolutional kernels. Adding more biases to generate multiple “shifted” feature maps, as in MBA, is a cheap way to alleviate this problem. However, the experimental part of this paper leaves something to be desired. Here are my concerns: - Results on CIFAR-10 and CIFAR-100 look impressive, but they are not enough. Results on more datasets are needed to evaluate the performance of the proposed MBA, such as svhn, imagenet and etc. - It seems to me that the results presented in Table 2 are copied from other papers. Please clarify this. If it is true, the paper should explicitly tell the reader about this. Also, this is not a fair comparison because the baseline methods are in different experimental settings. It is hard to conclude from Table 2 that the good performance is due to the proposed MBA. - An important detail the paper fails to mention is the hyper-parameter selection and early stopping. In addition, no validation set is mentioned. If the proposed method selects hyper-parameter values and early-stop epoch based on results on the test set, then there is no wonder the proposed MBA beats other state-of-the-art methods. I wish the authors could clarify this issue in the feedback. - One disadvantage of MBA is that more convolution kernels are introduced for the subsequent layers since there are more feature maps. Table 1 is not a fair comparison from my perspective, as the proposed MBA has much more parameters than APL and vanilla conv net. A better comparison is to make APL deeper or wider so that its amount of parameter is roughly the same as the proposed MBA. Also, the shallow arch for vanilla conv net (error=2.1%) in table 1 performs much worse than I expect. A shallow vanilla feed-forward neural network can achieve around 1.4% with careful tuning, let alone a conv net. Another interesting baseline is the vanilla conv net with the same amount of feature maps as MBA. Of course, this conv net would have more parameters. But its expressiveness power should not be worse than MBA in theory, because MBA can be regarded as its special case by setting convolution kernels to 0. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposed a new structure in deep learning framework. The structure will allow multiple retified linear units (with different bias) on one output of the lower layer (convolutional layer in this case). Experiments were conducted to show a deep neural network with this module achieved state-of-art performance on CIFAR dataset. Clarity - Justification: The paper is overall well written and is easy to follow. Significance - Justification: As for this paper, I am not totally convinced with the paper's motivation, at least for the presented example. On one hand, the output of the convolution on mouths and eyes will have different bias. On the other hand, mouths of different people will also have different bias, which cancels the argument that different biases help to separate mouths and eyes. However, given the good results achieved, I would think there might be something I don't understand (since I don't understand most things) that works. So I think it is still noteworthy like the other tricks we are trying out. According to this page, other people have better results on CIFAR-10: http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html#43494641522d3130 Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Over recent a few years, many tricks have been proposed to improve the performance of deep neural networks. These tricks sometimes work, and sometimes don't. Some of them have good motivations, some of them work magically. Usually it is very hard to single out the effect of the proposed method since there are too many parameters and hyperparameters in the deep neural networks. We usually just try different things to see which one works. I think this work is a potentially useful addition to our toolkit. =====