Paper ID: 281 Title: A Kronecker-factored approximate Fisher matrix for convolution layers Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Under some assumptions on independence, the Fisher matrix of a neural network layer can be factored as the Kronecker product of two small matrices. This can make the second-order optimization of neural networks tractable. This work extends the idea to convolutional layers of neural networks. The factorization is under three assumptions: 1. The activations are independent of the derivatives. 2. The activations and derivatives are spatially homogeneous 3. The activations and derivatives are spatially uncorrelated Clarity - Justification: Background and related works are well presented under the page constraint. The paper is clearly written to present the main idea. The additional technical details are included in the supplementary material. Significance - Justification: While second-order optimization methods are extensively studied, most neural networks (which are hugely successful these days) use only the first order methods. This work, along with a few previous works, provides us important insights into practically applying second-order methods in the optimization of neural networks. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I think this paper is well written. Here are some comments and suggestions: Neural gradient vs. First order methods: I am wondering whether it is conclusive that second-order methods are "better" than first order method for such large-scale non-convex problems. While the author has done a good job in providing related works in Section 2.1 under the page constraint, I think it is a good idea to add more background on this aspect. In the experiment session, it is worthwhile to conduct a small-scale experiment on optimization with natural gradient without KFC. Ideally, adding two additional curves on Figure 1 and Figure 2 will make it clearer whether the drop of test error is due to the approximation or due to the natural gradient method itself (I haven't been able to look into the supplementary material, but it is good to know that the assumptions made in Section 3 have been verified and further discussed). Experiment: In the Figure 1 (Cifar), and Figure 2 (Cifar), the test error of the KFC-pre is worse than that of SGD. This is interesting and requires some further discussions. Further comments: Taking this problem further, I think one interesting question to answer is Can one design a neural network (with a generalized convolution operation) whose neural gradient in "convolutional layers" is exactly the form proposed in the paper. Being able to explicitly control the factorized form will also provide a trade-off between the cost and model capacity. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper is an extension of the Kronecker-Factored Approximate Curvature (K-FAC) optimizer for convolutional neural networks. K-FAC was created for fully connected networks and is not directly applicable to convolution, notably because of the weight sharing. The paper reviews the K-FAC method quite well and proposes 3 approximations that allow the extension to the convolutional setting. The method is validated with classification on CIFAR and SVHN. Clarity - Justification: The paper reads well but the notation is too heavy on indices and special symbols. The notation could be simplified. For example doing away with the bar sign to denote concatenating with the bias vector. Significance - Justification: The paper describes an interesting new direction for optimization methods. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): It is not clear if the baseline SGD in the experiments uses classical momentum or Nesterov momentum. Nesterov momentum converges faster on the datasets considered and that is what was used in the cited (Kingma & Ba, 2015) paper to justify not comparing with RMSProp. Line 609 says that the Kronecker factors are not computed on GPU, it is not clear how they are computed from the paper or the long appendix. If they are computed on CPU, then it seems the solving time and transfer time would be a significant overhead for the method. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces Kronecker Factors for Convolution (KFC), an extension of KFAC to convolutional layers. The basic idea behind KFAC is to model the per-layer block diagonal of the Fisher as the Kronecker product of smaller, tractable matrices: the covariance of unit-activations, and the covariance of pre-activation derivatives. KFC extends this idea to convolutional layers, with the extra assumption that the covariance matrix over activations is homogeneous over spatial locations, i.e. the covariance is parametrized as C(i,j,\delta) where i, j are feature map indices and delta is the spatial offset between two activations. For MxM convolutional filters, this is thus equivalent to whitening each MxM patch independently, in constrast to batch normalization or PRONG which only {normalize, whiten} along the feature dimension. Experiments are carried out on CIFAR-10 and SVHN. Clarity - Justification: The authors are very thorough in their derivations, and the notation is clear. My only concern is the constant back and forth required between the main text and the appendix. Significance - Justification: This is a strong technical paper, which tackles the important topic of scaling second order optimization methods to large (convolutional) neural networks. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): My main issue with the paper as it stands, is the constant back and forth required between the main paper and the appendix. I believe the paper would benefit greatly from a reorganization, with the following sections being moved to the appendix: * section 2.1: this section provides a more general background on second order methods, but is not strictly required for understanding the contributions of this paper. * lines 286-329. as a background section on KFAC, it is not necessary to go into details of how to adapt the Kronecker factorization to include dampening. A simple reference to the KFAC paper would suffice, or a link to the appendix. * idem for equation 522-536. * Adagrad, RMSProp, and Adam results should be included in appendix for completeness. This would then allow the following to be included in the main text: * Experiments D.1: which studies how the modeling assumptions hold in practice. Very important to convince the reader and give intuitive understanding. * Experiments D.2: comparison to batch normalization (most widely used method at the moment) * Proof of Theorem 1 (main result of paper) * Notation at line 570, without which Theorem 2 is unintelligible. Could the authors also expand on why SUD was necessary ? I appreciate from D.1 that this assumption may be more adequate to gradients than activations, however, it is not clear to me whether this was "necessary" from an algorithmic perspective, or simply worked well enough not to bother modeling spatial correlations in gradients. I would also strongly encourage the authors to include in Figure 1-2, KFC results with SUD applied to both activations and gradients (the approximation made by PRONG). Other: * Equation (4) is missing the \mathcal{D}, e.g. $vec(\mathcal{D}W_l)$ and not vec(W_l) * line 259: again missing \mathcal{D}. $\tau_l$ should be $\tau_l = \mathcal{E}[\mathcal{D} s_l \mathcal{D} s_l^T}$ * line 497,498: isn't this simply T1*T2 given that the image is assumed to be padded by R (line 348-350)? * the numbers at line 441 do not add up. There are 128x48x8x8=153728 weights, and thus 23.6 billion entries in Fisher (not 60.5) * fix discrepancy between lines 536-542, which use weights & biases and line 441 which deals with homogenous coordinates. =====