Paper ID: 544 Title: Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents Normalization Propagation which is an alternative to batch normalization that has been popularized recently to address covariate shift in learning deep neural networks. NormProp is an answer to overcome some of Batch Normalization drawbacks. The method behavior is investigated on CIFAR and SVHN datasets and show moderate improvements. Clarity - Justification: The paper is well written and structured. Significance - Justification: The paper proposes an original normalization method that brings some benefit with respect to Batch Normalization. Yet although the topic is of great importance today in the deep learning community the improvements brought by this new method appear to remain rather limited in terms of convergence speed and of final accuracy. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper presents Normalization Propagation which is an alternative to batch normalization that has been popularized recently to address covariate shift in learning deep neural networks. NormProp is an answer to overcome some of Batch Normalization drawbacks. The method behavior is investigated on CIFAR and SVHN datasets and show moderate improvements. The basic idea of the method is to rely on basic assumptions on the distribution of neuron’s preactivation in every hidden layer to get yield relevant normalization process without computing statistics on minibatchs, thus without requiring extra computation. The method is described for ReLU activations functions but hints are given to extend the results to other nonlinear activation functions (while I wonder how actually this is easy). The main result of the paper, in Proposition 1, provides the basic tool for performing normalization so that the preactivation of a hidden layer in the neural net is close to a canonical distribution with zero mean and unity variance. It says that provided that the weight matrix rows are normalized to unity the distance from the covariance matric of preactivations to unity matrix is upper bounded by a term which depends on the coherence of the weigh matrix. Based on this result the authors provide a normalization schema for achieving normalization after the ReLU activation. The process may be repeated to successive hidden layers making normalization efficient and independent on any biased batch statistics. Yet as stated by the authors one has no real control on the coherence and one must rely on the usual observation that good representations are usually incoherent and expect that the coherence will be low, which, fortunately, seems to be verified in practice as shown in the experimental section. The authors next review the weight initialization strategies in the light of previous analysis and details which ones are good candidates for their NormProp strategy. Experimental results show how the NormProp allows slightly faster convergence to achieve normalized activations in all hidden layers compared to Batch Normalization. Besides it is shown to yield accurate classifiers which may outperform batch normalization on few image classification tasks derived from the CIFAR and the SVHN datasets. The paper is well written and structured. It proposes an original normalization method that brings some benefit with respect to Batch Normalization. Yet although the topic is of great importance today in the deep learning community the improvements brought by this new method appear to remain rather limited in terms of convergence speed and of final accuracy. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces normalization propagation, a method for approximately normalizing the activations of a neural network in closed form. The method is closely related to batch normalization, but does not require estimation of data dependent statistics, instead relying on normalization of the weight vector incoming to each hidden unit. The method is theoretically justified from a bound on the covariance of Wx, which involves weight norms and a coherence term between the rows of matrix W. This bound becomes tight when we assume zero-coherence, and unit-norm weights. NormProp further accounts for the ReLU non-linearity by introducing a shift and scaling term, which in expectation preserves the zero-mean and unit-variance property. The method is evaluated on CIFAR-10, CIFAR-100 and SVHN. Furthermore, the benefits of the method over batchnorm are highlighted by showing more effective centering of pre-activation units, a more stable convergence curve, and finally an invariance to batch size. Clarity - Justification: (see detailed comments) Significance - Justification: (see detailed comments) Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I enjoyed reading the paper, especially the analysis performed in Figures 1-4. It is particularly striking that NormProp can achieve almost the same convergence speed (in terms of epochs) using batch size 1. For completeness, I would be interested in seeing SGD and Rmsprop baselines in Figures 3 and 4. One area for improvement however is clarity. In particular, various sections of the paper are written such that weight normalization can be interpreted either as a hard-constraint via projection onto a unit sphere (line 270, line 591) vs a scaling term in the activation function (line 285). This confusion is resolved by Eq.9 but this appears much later in the paper. I also found the results of Proposition 1 to be overstated. Contrary to what is written in the paper, Prop 1 does not say that "the covariance of the pre-activation is approximately canonical". The fact that the covariance can be upper-bounded is meaningless in and of itself: it is the tightness of the bound which will make \sigma more or less canonical. The paper also brushes aside possible coherence in the filters, coming to the conclusion that "the above bound ensures the dimensions of u are roughly uncorrelated". This statement is unfounded and I would urge the authors to visualize the eigenvalues of \Sigma in a deep network: the eigenspectrum can be quite peaked, with very few (e.g. 10-20) principal components in a 500 hidden unit layer. Just because "good representation [...] are roughly incoherent" doesn't mean that SGD will find these uncorrelated representation. Being a first-order optimization method, SGD is blind to possible correlations in the gradient which translates to a non-zero coherence term. If this were true, KFAC/FANG/PRONG would not yield any advantage over a diagonal natural gradient method. The paper would also benefit IMO from describing the method in the more general context of natural gradient descent, and recent papers on data [in]dependent centering, normalization and whitening. Finally, the paper would also benefit from settings in which batchnorm has been difficult (or impossible) to get working: e.g. recurrent networks, or pure online learning. Other: * line 151: In (Desjardins, 2015) it was shown that an SGD in the reparametrized space is equivalent to an approximate natural gradient step in the standard model. * line 428: $\Sigma = \sigma W W^T$ does not follow from Prop 1 (which is a bound), but rather from the identity Cov(Ax) = A Cov(XX^T) A^T, with Cov(XX^T) assumed to be identity. * line 591: "after every training iteration". Does this refer to an epoch ? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a novel approach to solving the problem of internal covariate shift in deep neural network training. Internal covariate shift, which is a concept first introduced by (Ioffe & Szegedy, 2015), is the problem that the higher layers in a deep neural network can be slow to train because they are effectively trying to learn a function f(Y|X), but during training the distribution of the input, P(X), changes as the parameters of the preceding layers change. (Ioffe & Szegedy, 2015) introduced a modification to neural network training, called batch normalization, that forces the inputs to each layer to have zero mean and unit variance, where the mean and variance are estimated within a minibatch of training samples. This paper proposes an alternative approach to solving the problem of internal covariate shift in which the inputs to the network are normalized to have zero mean and unit variance (a common choice in many settings), the rows of the weight matrices are constrained to be of unit length, and the outputs of the ReLU nonlinearities are shifted and scaled to have zero mean and unit variance. Like in batch normalization, additional trainable scaling and bias variables are introduced to allow the network to learn general functions. The paper also proposes weight initialization strategies that should work well with the proposed training algorithm, discusses the application of the proposed method to networks with convolutional layers or nonlinearities other than ReLUs, and describes experiments intended to show the effectiveness of the algorithm. Clarity - Justification: The authors appear to have altered the paper template to shoehorn more text into the available space. The ICML template has about 13 lines of text per every two inches of vertical space, but this paper fits 14 lines of text into two inches. The text is visibly more crowded than in the original ICML template and is definitely more difficult to read because of the crowding. In general, the writing style in the paper is overly verbose, and the authors should have focused their efforts on editing their text for conciseness instead of changing the template. For example, the abstract is a bit long, and the discussion of different approaches to weight initialization could be much shorter, since the authors can cite the relevant papers. Significance - Justification: The algorithm described in this paper is a nice combination of ideas that is easy for other researchers to implement and test. I could easily see it supplanting batch normalization if other researchers can replicate the results in this paper. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I've already highlighted the positive aspects of this paper in my justification for the significance rating. Unfortunately, the paper also has some substantial negative aspects in addition to the issues with overly verbose writing and the alteration of the paper template mentioned above. First, the analysis supporting the proposed algorithm can only be applied to the input layer of a network. The canonical error bound (Proposition 1) presumes that the input features are zero mean and have a scaled identity covariance matrix. It isn't at all clear that the inputs to later layers, which will be vectors of random variables having a scaled and shifted rectified Gaussian distribution, will have the proper covariance for the analysis to hold. To be fair, though, the original batch normalization paper simply *assumed* that the input to each layer would be roughly Gaussian distributed and made no attempt to prove this, so perhaps the authors of this paper deserve some credit for trying to provide more analytic support. Second, some of the arguments in the paper are hard to follow or are not entirely sound. "...BN leads to a better convergence irrespective of whether the given dataset has been normalized (mean and std) or not. This is because of explicit normalization at every layer entailed by BN. While this might seem as an attractive feature of BN, it also reflects the incapability of BN to exploit any normalization already done over the dataset." I simply don't see how you can argue that batch normalization's robustness to dataset normalization is a disadvantage. "...the gradient computation step can lead to unbounded weight length, while weight-decay prevents this during the update step although it doesn’t practically achieve a desired value of weight length." This argument is a bit hard to follow. Weight decay (L2 regularization) doesn't enforce a bound on the lengths of the rows of the weight matrices, so I don't see how you can say it prevents having an unbounded weight length. One could argue that steepest descent isn't appropriate if you are enforcing an L2 norm constraint on the weights, and that the real solution is to do proper gradient descent on the manifold defined by the L2 norm constraint, but that is really off-topic for this paper... "Batch Data Normalization also serves as a regularization since each data sample gets a different representation each time depending on the mini-batch it comes with." This argument for batch data normalization is inconsistent with your previous argument for it based on having streaming data. In Equation 9, doesn't the presence of a scaling factor (\gamma_{i}) that differs from 1.0 reduce the benefit of the "Jacobian factor" scaling? Would a simpler implementation initialize the scaling factors (gammas) to 1.0 / 1.21 and then just let the learning adjust them? In Section 4.4 you say you "[reduce] the learning rate by half whenever the training error starts saturating," but in Section 5 you say you "reduce [the learning rate] by half every 10 epochs." In the experiments described in Section 5.2, you should also track the input distribution variance as well as the mean if you want to claim that "the input statistics to hidden layers are roughly preserved." It would be interesting to see how the input mean and variance to an entire population of units in each hidden layer changes with training, rather than tracking individual units. In Figure 2, the first two plots should have the same y-axis limits. Figure 4 is not very clear -- it might be better to choose one mini-batch size as the baseline and show the deltas for the other batch sizes. There are some consistent errors in English usage that should be corrected because they are quite distracting: "it's" -> "its" ("it's" is the contraction of "it is," while "its" is the possessive form of "it") "specially" -> "especially" "std" -> "standard deviation" =====