We would like to thank all the reviewers for their productive feedback.$
Assigned Reviewer (AR) 1:
1. ...seeing SGD and Rmsprop baselines in Fig 3 & 4.
We have anonymously uploaded the graphs for figure 3 and 4 at http://anony.ws/image/JYfs and http://anony.ws/image/JYfY respectively with RMSProp and SGD results along with SGD-Momentum. We surprisingly found RMSProp performed worse than SGD-Momentum for both BN and NormProp. On the other hand, RMSProp generally performed better than SGD but SGD with batch size 1 was very similar to SGD-Momentum.
2. ...visualize the eigenvalues of \Sigma in a deep network...
We found there were indeed a few peaks in the eigenvalue spectrum of hidden layers' inputs as you suspected. However, we believe strictly uncorrelated dimensions are not absolutely necessary for NormProp. As long as the bound in proposition 1 is roughly tight, the learning process seems to compensate for the approximation.
3. ...after every training iteration...refer to epoch?
After every iteration.
AR2:
1. It isn't at all clear...inputs to later layers...will have proper covariance for analysis to hold.
Proposition 1 shows that if input to a layer has roughly uncorrelated dimensions, then the output will also have similar covariance property if the weight matrix is incoherent and weight lengths are 1. The Gaussian assumption on pre-activation (also made by BN) simply paves way for our parametric approach to calculate the mean and standard deviation values for all layers assuming the previous layer's covariance is roughly normalized. Notice having roughly uncorrelated dimensions does not mandate Gaussian assumption. This is why it is valid to apply the same parametric normalization trick after ReLU for Rectified Gaussian distribution as well.
2. ...how you can argue...BN's robustness to dataset normalization...
The statement refers to the scenario when data is already normalized. In this case, BN would still need to perform hidden layer normalization but NP simply propagates it. We do appreciate BN which inspired our work in the first place. However, we understand our tone is aggressive and we have modified it to reflect that.
3. Weight decay doesn't...bound...lengths...
Setting regularization coefficient to infinity implies all weights from optimization are zeros. Smaller coefficient on weight decay acts as a weak regularizer but indeed does prevent weight length from exploding since weight decay acts as a penalty to the objective we minimize.
4. ...simpler implementation initialize scaling factor to 1/1.21...
That is a good point and sounds more intuitive.
5. In section 4.4...reduce learning rate by half...error starts saturating...section 5... reduce...every 10 epochs.
Section 5 experiments were done for convergence analysis. The final classification experiments reduce learning rate after every 25 epochs when training error indeed saturates.
6. Figure 2...plots should have the same y axis limits.
After ReLU, output for BN is non-negative while NormProp normalizes ReLU output to have zero mean.
AR3:
1. The paper proposes an original normalization method...remain rather limited in terms of convergence speed...
Our approach offers a parametric approach for addressing Internal Covariate Shift in contrast to BN; thus our approach is fundamentally different. Also, NormProp omits the need for calculating mean and standard deviation of higher layers and is hence faster in terms of per epoch time taken while enjoying similar convergence speed as BN. Specifically, on CIFAR-10, one epoch of training using NormProp takes 84 sec compared with 96 sec taken by BN on our single GPU machine. We will include these numbers in the final version (if accepted).
AR 1,2,3:
We would also like to mention that extension to other activation functions is straight forward and only requires computing the post-activation mean and standard deviation with "Normal distribution" as input. This can be done either analytically or by simulation. Rest of the NormProp implementation details remains the same as showed for ReLU.