Paper ID: 1016 Title: Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Authors study replacing rectifier nonlinearity in deep vision architectures by nonlinearity that keeps both positive and negative values of the units (R(x), R(-x)). They seem to do careful experiments showing that it improves performance, mostly when present in lower layer. They motivate the nonlinearity by showing that filters in rectifier architectures come in approximate pairs (having opposite direction). With this nonlinearity the pairing disappears. They give insights into operation of deep convolutional nets, by showing filter pairing, studying reconstruction property of the nonlinearities and showing improvement in invariance properties. Clarity - Justification: Paper is nicely written. Significance - Justification: It is good to see insights into operation of deep networks. The new nonlinearity might become useful in new architectures. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - It would be convenient to write what pairing filter is in the Figure 2 caption. - I am not sure how to judge the quality of Figure 6 reconstructions. What should I compare it to? - The reconstructions are produced using linear map? Why? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The main contributions are: - This paper shows that preserving the negative response of Relu units is helpful for CNNs (at least for the lower layers). - This is motivated by the observation that lower layers of CNNs end up learning negatives for each filter. - A new type of activation function (CRelu) is defined which maps x to (max(0, x), max(0, -x)). - Experimental results on CIFAR-10/100 and ImageNet show that replacing Relu by CRelu improves generalization (even when the number of units is halved to preserve the number of parameters). Clarity - Justification: The paper is well written and easy to follow. The motivation for designing CRelus is very compelling. The experiments are well designed and the results are convincing. Significance - Justification: This paper proposes a simple modification to Relu units which consistently gives better results. From a strictly modeling perspective, the contribution is incremental. However the simplicity makes it likely that a lot of neural net practioners will use this and find it to be helpful. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The main strength of this paper comes from the experimental results that show that using CRelus in the lower layers of conv nets gives consistent improvements over Relus. For example, adding CRelu to VGG net applied to CIFAR-100 reduces error rate from 29.28 -> 26.22. The paper also provides a detailed discussion about why CRelus are helpful (Section 4), analyzing it from the point of view of regularization, increasing invariance and preserving information. One way to further improve the analysis might be to see how well the absolute value non-linearity performs. As the paper mentions, this will preserve the modulus information but remove the phase (lines 198-200). This comparison will help tease apart whether the network benefits from retaining the modulus only or is the phase also important. Intuitively it seems that phase would be important but some redundancy in the feature space might be sufficient to compensate for it. Another way to test this would be to see how the network uses the two components of the CRelu. Are the outgoing weights from them related in some way ? If they are similar (normalized inner product close to 1), then it means the network is really just interested in the modulus. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a modification of the popular half-rectification in CNNs that keeps both positive and negative phases in separate outputs. This ensures that each layer is linearly invertible and responds to an empirical observation that ReLU networks tend to group filters in negatively correlated pairs. Numerical experiments confirm the advantage offered by CReLU (concatenated ReLUs) on mid and large scale image classification Clarity - Justification: The paper is clearly written, with sufficient prior work, model analysis and numerical experiments. The notation is lightweight and the figures are informative. Significance - Justification: This paper presents an interesting alternative to ReLU that, in a sense, simplifies the analysis of CNNs, and at the same time it slightly increases its performance. The paper presents a theoretical analysis that addresses the properties of the resulting network, and justifies through several experiments the advantage of the approach. Since this approach is orthogonal to some other recent CNN optimization techniques (such as residual learning), it would be interesting if it can contribute to advance the state-of-the-art. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Here are some detailed comments: - Although the baselines considered in the paper are important, it would be informative to compare also with the modulus (or full rectification) nonlinearity, which can be seen as applying a CReLU followed by an averaging of the two phase components. - Perhaps related to the previous comment, how much averaging does the network do between the two phase components? this can be easily measured for example by computing the correlation between the two slices of the next layer corresponding to positive and negative phase. It would be interesting to see if the network chooses to separate its contributions (which would be strange, since these two responses are very sensitive to local shifts and therefore one would imagine there is not a lot of useful information in the variability within these two-dimensional subspaces). -Perhaps I did not understand the notation, but theorem 2.2 together with table 6 seem to give a vacuous result. One can safely say that any reasonable inverse x' satisfies the bound with a constant 1 (it sufficies to set x'=0), so I am a bit puzzled by the constants reported in table 6. What is the value of the bound in that case? -In a sense, this reflects the fact that one can possibly improve recovery by using a nonlinear decoder instead of the linear one. Although it is true that each convolution+CReLU can be inverted with a linear map, the concatenation of these layers together with pooling means that there is structure in the image of each map. Could the authors comment on the pertinence of using linear decoding? =====