Paper ID: 279
Title: Network Morphism

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a new method for "morphing neural networks", i.e. adding neurons and/or layers to a trained parent network to create a bigger child network. The proposed method preserves the parent network function so that performance is not affected. The bigger child network can then be trained to take advantage of the new neurons but requires much less training since it can leverage all the training of the parent network  Non-linear activation functions are nicely taken into account with a new class of activation functions which allow for the identity and for any other non-linear activation function.  The method makes it possible to change the architecture of an already well performing model to get a better model which is interesting for exploring different networks architectures quickly.  The experiments show that the proposed method can improve performance on very big networks and on a difficult problem (ImageNet)

Clarity - Justification: The paper is very clear. A few english mistakes which do not affect comprehension .

Significance - Justification: The paper makes a significant and novel contribution by allowing the structure of a network to change in several ways after it has been trained. Although the performance gains obtained after such changes are modest, improving a huge model quickly on ImageNet (a hard problem) is impressive.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Section 3.4.1: It is not clear why the random permutation of c_l is needed. What would be the problem with grouping the new weights together ? 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a set of methods for increasing the capacity of a network while maintaining the trained performance. The methods address adding linear layers, including convolutional layers, adding non-linear activations, increasing layer width or kernel size, and adding subnets. The methods are compared closely to IdMorph, from the Net2Net paper, and shown to give better results on CIFAR and Imagenet experiments.  

Clarity - Justification: The motivation, approach, and results are mostly clear. 

Significance - Justification: This is a valuable tool for deep learning researchers and practitioners. It may allow for significant new developments in training large networks, and there is also obvious relevance for continual learning and transfer learning research. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): NetMorph seems to be a powerful approach and the empirical results are strong. Although there is a clear overlap with the recently published Net2Net paper, there are significant improvements in this algorithm over IdMorph. IdMorph left a lot of zero values, which can be difficult to train. NetMorph specifically focuses on reducing the number of zero elements. NetMorph also proposes a general family of parametric activation functions such that a nonlinear activation can be continuously transformed between a linear identity transform and the desired final nonlinear activation. Thus an inserted nonlinear activation can be initialized as a linear identity and then adapted over the course of learning to its final shape.   The authors could have provided a better example of the morphing process and its effects. A simple concrete case that would exemplify the algorithm would be very helpful. The experimental sections are quite rushed.  It is unclear to me what is happening in the learning curves in figures 8 and 9. I can only guess that the sharp increase in accuracy at iteration 20000 is due to a change in learning rate, but this is nowhere mentioned in the text. There is also a significant drop in performance from the stated accuracy of the parent net (eg .7815 in figure 8a) that only recovers slightly ahead of the 'raw' network.  The authors do not address these factors. The stated contribution of the paper is to 'morph a well-trained neural network to a new one so that its network function can be completely preserved', however there is actually drop in performance that can only be fixed through continued training - this has to be addressed and explained. What are the factors that will hasten or slow the post-morphing performance recovery?   There is an obvious application of NetMorph in the field of continual learning. This could be addressed. One question is how well is knowledge retained after morphing if the training data distribution drifts after morphing? 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a method for morphing the architecture of a neural network while preserving its input / output mapping, allowing a network architecture to be made bigger during the training process. Compared to the previously proposed Net2Net technique, the proposed method is more general, and can be used to insert additional layers, add filters to existing layers, or increase the kernel size of existing layers; in addition the proposed method can be used with non-idempotent activation functions. Experiments using the proposed method are performed on MNIST, CIFAR10, and ImageNet

Clarity - Justification: On the whole the paper is well-written and easy to follow.  The term "deconv" should be explained a bit more; I assume that you mean "deconv" in the classical sense as the inverse of convolution, but this should be clarified as many recent papers use the term "deconv" to refer to fractionally strided convolutions.

Significance - Justification: I am not convinced that network morphing is a significant and important problem. See detailed comments.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall I think that the method presented in Section 3 is novel an interesting, and has clear benefits compared to prior work. However I find the experiments unconvincing, and am not sure about the practical benefits of network morphisms. I also have a pedantic question about terminology.  METHOD I find the idea of network morphisms to be intellectually interesting, as they are quite different from both parent-teacher networks and from traditional finetuning. The method presented in Section 3 seems like a very sensible way to approach the problem. Compared to the existing Net2Net technique, the proposed method is more general as it can handle more general types of network transformations, and can be applied to networks with non-idempotent activation functions.  EXPERIMENTS However, I am not convinced of the practical utility of network morphisms. I appreciate the motivation of training larger models more quickly and of more efficiently exploring different architectures, I don't think that these claims are well-supported by the experiments. All experiments show NetMorph outperforming Raw training by only modest amounts, and in practical situations I'm not convinced that these small gains would be worth the drastic increase in system complexity.  I believe that additional experiments could help to make the results more convincing. In Figure 7, how does NetMorph compare with training the MLP architecture from scratch (Raw in Figure 8)?  My understanding is that the architecture and training schedule for the cifar10_quick model used in Section 4.2 was not designed to achieve state-of-the-art accuracy, but was instead designed to quickly achieve a respectable accuracy. Do the results of Figure 8 still apply if the cifar10_full model is used instead?  In Figure 8 (a) - (d) the performance gaps between Raw and NetMorph are quite modest; I am not convinced that these gaps are significant, and believe that more carefully chosen learning rate schedules for the Raw model might close these gaps.  The Raw curve in Figure 8(c) especially looks as if learning rate decay was applied prematurely, and that its accuracy would have continued to rise using the original learning rate. In these experiments were learning rates and learning rate decay schedules cross-validated separately per model, or were the same learning rates and decay schedules used for all models?  In all these experiments one might argue that Raw has actually converged faster than NetMorph, since NetMorph was initialized from a model that had already been trained for tens of thousands of iterations. An additional experiment that could help alleviate this concern would be to continue training the original unmorphed model; I would expect this to perform better than Raw but worse than NetMorph to support the claims of the paper.  I have similar concerns about the ImageNet experiment. How do you know that VGG-16 (baseline) had converged? VGG-16 (NetMorph) may have better performance simply because it had been trained for longer and not because its capacity had been enanced by the NetMorph procedure; as a baseline I would like to see the results of continuing to train the VGG-16 (baseline) model for the same number of iterations as the VGG-16 (NetMorph) model was trained for.   ALGEBRAIC CONNECTION As an admittedly pedantic point, I don't fully understand the significance of the connection between algebraic morphisms and the proposed method. In general a morphism is a map that preserves algebraic structure; for example in the category of groups a morphism f: G -> H satisfies f(x*y) = f(x)*f(y) for all x, y in G. The precise connection between this notion and the proposed method is not discussed in the paper, but I can see the following: let G be the set of all networks of one architecture, and H the set of all networks of another architecture; then a network morphism f: G -> H satisfies f(g) = g for all g in G, where elements of G and H are viewed not as neural networks but as functions from inputs to outputs. This seems trivial; is there a more profound algebraic interpretation that I've overlooked? 

=====