$We thank the reviewers for constructive comments and recognizing the significance of our paper. We will revise the paper reflecting all review comments and release our code. [Choice of VGGNet] R1: We would like to clarify that we took the VGGNet 16-layer (Model-D in [1]), which was not the weakest, but one of the most popular. In Table 1, we report validation errors in two schemes for single-scale evaluation: single-crop (“cropping-center”) scheme (which is not provided in [1]) and convolutional scheme (as described in Section 3.2 of [1]). Here, we report single-crop results because it is a simple protocol which allows us to examine the tradeoff between training and validation performance without complicated post-processing. In our experiments, the trained VGGNet-D model (publicly released by [1]) achieved 10.07% for the single-crop scheme and 8.94% for the convolutional scheme, which is comparable to 8.8% in Table 3 of [1]. The best reported number for the Model-D is 8.1% in [1], but it is trained and tested using a different resizing and cropping method, thus not comparable to our results. We will revise the paper to improve clarity and avoid confusion. * [1] VGGNet paper (ICLR’15) [Effects of decoders] R1: The decoding pathways showed joint effects of better optimization and regularization. As evidence for better optimization (Table 1 and L795-806), our models reduced “training errors”, achieving better local optima on the “training” set than the baseline (after convergence). Some variants of our models (SWWAE-all) showed stronger regularization effects than others (SWWAE-first), as they achieved relatively higher training errors and lower validation errors. Overall, we suspect that the auxiliary decoding pathway is beneficial in terms of finding better local optima but it also has a relatively small positive effect in terms of regularization. We will provide additional control experiments to clarify this. [Insights from reconstruction] R1: With surprisingly high-quality reconstruction, we showed that high-level features could preserve almost of all the information of the input image if the locational details were provided by pooling switches. Assuming it as a good property, we were motivated to finetune the classification network together with decoding pathways to learn better representation, which turned out to improve the classification accuracy. We will move a few images (e.g., two rows respectively from Fig 4&5) to the supplementary material to leave more space for describing the experiments. [Feature changes] R1: The element-wise relative change of the filter weights were ~35% on average. A small portion of the filters showed stronger contrast after finetuning. Qualitatively, the finetuned filters kept the pretrained visual shapes. [Training speed] R1: With batch_size=16, one training iteration cost 2.5 sec on a single Titan X. ~70% relative improvement could be achieved in ~1 day for Step 4 (L550). The numbers in Table 1 were from models trained for 3~5 days in parallel on 2 Titan X’s. We will add a learning curve and more details for the experiments. [Model selection] R2: We agree that the performance for different variants were comparable. However, since the computational costs were similar for training and the same for testing, we can use the best available architecture depending on tasks. For example, 1) When using decoding pathways for spatially corresponded tasks like reconstruction (as in our paper) and segmentation, we can use the SWWAE. For more general objectives like predicting next frames, where pooling switches are non-transferrable, we can still use ordinary SAEs to get competitive performance. 2) S(WW)AE-first has less hyper-parameters than S(WW)AE-all, and can be trained first for quick parameter search. It can be switched to *-all for better performance. [Hyper-parameters] R2: We chose the learning rates that lead to the largest decrease in the reconstruction loss in the first 2000 iterations for each layer. We compute layer-1 loss against RGB values normalized to [0,1], which are different in scale from intermediate features. Losses are also not normalized with feature dimensions. So the balancing factors for losses in different layers varied to make them comparable in magnitude. [Ladder network] R3: As discussed in the supp. material, a large-scale ladder network was more difficult to train than our models, potentially because of more hyper-parameters. We leave this as future work. [More unlabeled data] R3: Incorporating more unlabeled data is an important topic. Based on the encouraging results in our paper, more advanced unsupervised objectives can be tried out for decoding pathways.