We thank the reviewers for their careful and enthusiastic feedback. We especially thank R1 for valuable suggestions on how to fit all of the material within the page limit while maintaining readability.$ R1: why was the SUD assumption necessary? As R1 suspects, this assumption was not necessary to achieve tractability. We have also explored an alternative version of the algorithm which significantly weakens the SUD assumption, so that it can capture spatial correlations in the gradients. Unfortunately, this more general version is also much more complicated to describe and implement, and we did not observe any situations where it improved upon the algorithm presented in the paper. Since the paper is already rather notation-heavy, we figured it would be best to stick to the simpler version. R1: include comparison against a version which assumes spatially uncorrelated activations We will include this comparison in the final version. R2: is the drop in test error due to the approximation or to natural gradient itself? R2: why does SGD achieve a slightly lower test error than KFC-pre? The small differences in test error are mostly an artifact of metaparameter tuning. In particular, algorithmic metaparameters (e.g. learning rate, damping) were chosen to achieve the best optimization performance on the *training* set. For SGD, these also happened to perform the best on the test set; therefore, our setup was conservative and favorable to SGD. However, for KFC-pre, we could obtain much lower test error fairly quickly by reducing the mini-batch size (at the expense of higher training error). We expect that when one also tunes regularization metaparameters (e.g. dropout) jointly with the algorithmic ones, it would be possible to achieve better test error than one would get with SGD. R2: can you design a network architecture where IAD, SUD, etc. hold exactly? In a sense, conv nets are already “designed” to achieve SUD. I.e., following the empirical analysis in Appendix D.1 and Figure 3, max-pooling layers have the effect of decorrelating the derivatives. On the other hand, IAD seems pretty hard to guarantee. If the activations were truly independent of the derivatives, then it’s hard to see how any learning could take place. R3: try to simplify the notation (e.g. drop the bar notation) Unfortunately, messy notation is probably inevitable whenever we need to write out conv net computations in full. We have tried hard to choose clean and intuitive notation, and to move non-essential material to the appendix. In the case of the bar notation, we feel it’s necessary because we often need to refer to the activations themselves without the homogeneous coordinate appended. R3: did you use classical or Nesterov momentum? Our experiments used classical momentum, both for SGD and for KFC-pre. Nesterov momentum can be used with either method. We will include experiments with Nesterov momentum in the final version. R3: were inverses computed on the CPU, and did this incur significant overhead? Yes, we computed inverses on the CPU, and this incurred some overhead. However, we only periodically recomputed the inverses, and this kept the overhead manageable. (See Appendix B.3) In principle, one could make the overhead nearly free by computing the inverses asynchronously (while the GPU is still performing the other computations), but we did not exploit this in our experiments. Note that most of our results are given in terms of wall clock time, so the overhead of computing inverses is already factored into our results.