Thank you for the valuable comments and suggestions. We will reflect all your comments and revise the paper to provide more insights.$
R2, R6: comparison with Absolute Value Rectifier (AVR)
Thank you for your suggestions. We experimented with AVR nonlinearity on CIFAR-10, CIFAR-100, and ImageNet benchmarks. For CIFAR-10/100, we evaluated the baseline architecture as described in Sec. 3.1 while replacing ReLU into AVR; for ImageNet, we evaluated the All-CNN-B architecture while replacing ReLu into AVR for conv1-4 layers. Below, we provide error rates (single model) for ReLU, AVR, and CReLU:
CIFAR-10: 9.17, 8.32, 8.37
CIFAR-100: 36.30, 35.00, 33.68
ImageNet (top-1 error rate at epoch 14): 47.08, 46.74, 44.87
In our experiments, AVR achieved lower error rates than ReLU for both CIFAR-10 and CIFAR-100. Compared to CReLU, AVR matched the performance on CIFAR-10 but showed higher error rate on CIFAR-100. For the case of ImageNet, the models have not converged (it typically takes about 45 epochs to converge) due to the limited time, but the preliminary results so far showed a clear advantage of CReLU over AVR. These results suggest that (1) preserving the modulus and positive/negative phase information can be beneficial in a complementary way and (2) CReLU consistently achieves similar or better classification results compared to ReLU or AVR. We will include more control experiments and analysis in the final version.
R2, R6: correlation of outgoing weights
We measured the correlation of the outgoing weights between positive and negative phase pairs of filters of the CReLU model. For comparison, we also computed the correlation of the outgoing weights between non-corresponding pairs of filters and report the mean and the standard deviation. The correlations of outgoing weights for conv1-conv7 layers of CReLU (conv1-7) model on ImageNet, are reported below (please see Table S9 for network architecture):
conv1 pair: 0.372 (± 0.220) non-pair: 0.165 (± 0.154)
conv2 pair: 0.180 (± 0.149) non-pair: 0.157 (± 0.137)
conv3 pair: 0.462 (± 0.249) non-pair: 0.120 (± 0.120)
conv4 pair: 0.175 (± 0.146) non-pair: 0.119 (± 0.100)
conv5 pair: 0.206 (± 0.136) non-pair: 0.105 (± 0.093)
conv6 pair: 0.256 (± 0.124) non-pair: 0.086 (± 0.080)
conv7 pair: 0.131 (± 0.122) non-pair: 0.080 (± 0.070)
The outgoing weights between corresponding pairs are more correlated than non-corresponding pairs on average. However, the correlations are far below 1 for all layers. This suggests that the network with CReLU units does not simply focus on the modulus information only.
R2, R5: clarification on linear decoding
In the case of max-pooling (Theorem 2.2), we assume W_{x} is given, i.e., the switch units for max-pooling are known, to attain feasible theoretical analysis. We agree that nonlinear algorithms (e.g., [1]) could potentially provide qualitatively better reconstruction, but it would be very challenging to derive mathematical characterization of reconstruction property as in a linear decoding.
R2: clarification on Theorem 2.2
We are sorry for the confusion on the bounds provided in Table 6. We agree that the theoretical bounds are significant only if the activated weight matrix has almost orthonormal columns. As a result, the upper bound may not provide meaningful insight of the reconstruction properties.
Instead, we provide empirical reconstruction ratio ||x-x’||/||x|| for the conv2 and conv5 layers of CReLU (+half) models trained on CIFAR-10 and 100 as max-pooling comes after those layers (please see Table S2 for network architecture). We report the empirical reconstruction ratio averaged over 2000 test examples and their standard deviation. For comparison, we perform the same experiments on random Gaussian convolution filters. We observe that random filters only recover 1% of the original input, whereas the learned filters span more of the input domain.
CIFAR-10
Learned (conv2) 0.92±0.009 (conv5) 0.96±0.001
Random (conv2) 0.99±0.002 (conv5) 0.99±0.002
CIFAR-100
Learned (conv2) 0.93±0.009 (conv5) 0.96±0.004
Random (conv2) 0.99±0.002 (conv5) 0.99±0.002
R5: quality of reconstruction
We proved in Proposition 2.1 that the input that is on the range of learned convolution filters can be recovered from CReLU activation. In other words, the proposed reconstruction algorithm can qualitatively measure how representative the learned convolution filters are about input domain. The goal of showing Figure 6 is to assess the amount of information the convolution filters trained for recognition on ImageNet have for reconstruction, but not to compete with other works aiming at building high-quality reconstruction of supervised CNNs with additional learning of decoder [1] or via expensive optimization [2].
[1] Dosovitskiy et al., Inverting visual representations with convolutional networks, CVPR, 2016
[2] Mahendran et al., Understanding deep image representations by inverting them, CVPR, 2015