We appreciate the recognition and detailed comments from all reviewers. Below are the responses to address their concerns.$ [#R7] The significance of network morphism. [#R7] “NetMorph outperforms Raw training by only modest amounts.” 1) NetMorph is a powerful tool to build a better model. It is very difficult to further improve a well-trained DCNN model (also pointed out by [#R8]). E.g., ResNet [He2015a], the winner of ImageNet2015 competitions in all three tasks, only made a 1.36% accuracy improvement by extending the model from 50 layers to 152 layers with an excellent design and implementation. The VGG team only achieved statistical improvement when extending the network from 16 layers to 19. However, using NetMorph, we can easily achieve 1.84% accuracy gain with only 3 simple layers added (Table 1). 2) NetMorph provides a very efficient way to explore and design better model structures. It usually costs weeks or even months to train an existing model. The time and resource cost might be ten folds when designing a new powerful model. NetMorph can significantly reduce the exploration cost. E.g., in two weeks, NetMorph has already helped us find a new model structure based on ResNet, with 1.7% accuracy improvement on ResNet20 (1.1x computation), the morphed ResNet56 was also able to achieve a better accuracy than ResNet110 with only half of the computation, and we further achieved 0.5% accuracy gain on ResNet110 before the training is finished. 3) NetMorph significantly saves the training time and GPU memory cost. The proposed 19-layer model is trained 15x faster than training from scratch. 4) NetMorph internally regularizes the network and has potential to prevent overfitting. Moreover, as pointed out by [#R6], it also has wide applications, such as incremental learning for life-long systems. [#R7] “Continue training the original unmorphed model” would easily improve the performance. This is usually not the case. Many experiences show that, for a well-trained model, “continue training” will only result in a similar result with statistical variances, rather than a result “better than Raw but worse than NetMorph”. E.g., [He2015a] proposed to extend the network from 50 layers to 152 for only 1.36% accuracy improvement rather than with a simple “continue training”. [#R7] “Do the results of Fig. 8 still apply if cifar10_full is used instead?” Cifar10_full in Caffe only achieves 81.80%. As shown in Fig. 8, NetMorph improves the performance up to 84%, which is much higher than both cifar10_quick and cifar10_full. [#R7] “How do you know that VGG­16(baseline) had converged? VGG-16(NetMorph) may have better performance simply because it had been trained for longer and not because its capacity had been enhanced by the NetMorph procedure.” We did everything we can to make it converged. Note that the performance in Table 1: VGG16(NetMorph) > VGG16(multi-scale) > VGG16(baseline). VGG16(multi-scale) is obtained from Caffe model zoo. We believe that the original authors have made every effort to improve the performance, hence it definitely converged. From experience, multi-scale could contribute to up to 5% performance improvement than single-scale (baseline). Thus, it is impossible for VGG-16(baseline) to achieve the same performance as VGG-16(multi-scale) only by “continue training.” This is also true for VGG(NetMorph). [#R7] “More carefully chosen learning rate schedules for the Raw model might close these gaps.” We adopted a uniform learning scheduler, without any bias to a specific model. We also carefully tuned learning schedulers for both, but did not reduce the gap. [#R6] The function preservation property of NetMorph, and the sharp drop/increase in Fig. 8-9. NetMorph does preserve the function and performance of the parent network. The accuracies at iteration 0 in Fig. 7-9 are the parent networks’ performances. The sharp drop and increase are caused by the changes of learning rates. Since the parent network is learned with a much finer learning rate (1e-5) at the end of its training, for the morphing process, we recover it to a courser learning rate (1e-3) from the start, hence there is an initial sharp drop. At 20K iterations, the learning rate is reduced to 1e-4, hence the sharp increase. At 30K iterations, the learning rate is further reduced to 1e-5, but the increase is not that significant. It is worth noting that if we adopt a constant learning rate, there will be no initial drop, as shown in Fig. 7, which was morphed with a constant learning rate of 0.01. [#R6] “How well is knowledge retained after morphing if the training data distribution drifts after morphing?” This is an interesting topic to follow in our future work. Currently we focus on the same dataset. [#R8] It is not clear why the random permutation of c_l is needed. We do not want the zeros clustered together. The random permutation of c_l can avoid this.