We sincerely thank the reviewers for their comments. All questions and concerns are addressed below.$ R4: 1. Yes, the architecture used for the fine model (Table 2), which we describe in detail in Appendix 6.2, is very similar to the model of Goodfellow et al. (2013). The main difference is that we do not use maxout units, but as reported in Jaderberg et al. (2015), we find that this is not very important as one can achieve the same results on cropped SVHN images with a similar architecture using only ReLu units. 2. The strong baseline you are referring to would presumably be trained with information not accessible to our learner. As such, it represents an “unfair” baseline. Further, digit locations are not available in cluttered MNIST, while for SVHN, in which digit locations are available, it would make the problem easier. However, this baseline could be an interesting upper bound to gauge the performance of our models. We can also refer to the performance of models trained on the cropped images, e.g. (Goodfellow et al., 2013), (Ba et al., 2014), and (Jaderberg et al., 2015). R5: 1. Your understanding is correct. Computing the saliency map requires an additional forward and backward pass, but only on the top layers. Hence, we need to insure that top layers incur much fewer computations than the fine layers (which is typically the case). More precisely, we need to insure that the total number of computations in: 1) coarse layers on the full input, 2) fine patches, and 3) saliency map, is significantly less than computations of the fine layers on the full input, in order to gain computational savings with DCN. We observe that on MNIST, we can achieve a 2.9x wall clock time speed-up (as reported in 4.1.3) and a 5.8x average time speedup on large SVHN images (as reported in 4.2.4). 2. The intuition behind using the hints objective is to insure that the refined representation vectors fed into top layers are homogeneous. We show that it can indeed draw fine and coarse vectors to have close values (Figure 2), and that it improves the model's generalization by reducing the test error from 1.71% to 1.39% (Table 1). R6: 1. We believe that our attention mechanism would still work when coarse predictions are confident but wrong. Devoting more capacity through the fine layers would allow the DCN to rectify its prediction, which means that it can remove probability mass from the incorrect class and increase its uncertainty. Hence, this will increase the likelihood of the model taking the right decision. 2. We do not observe a clear drawback for using the hints objective. However, we need to make sure it does not override the signal coming from the classification objective. The classification objective provides a direct signal for optimizing the coarse layers on the given task, which helps in attending to task-relevant input regions. 3. We have been focusing so far on models that can dynamically assign more capacity by replacing initial coarse representations with fine ones. This has the advantage of allowing the model to dynamically change the number of features replaced, either during training or between training and testing. We agree that this is an interesting research direction that can be explored in future work. 4. We have not contacted the other authors about this point, but we hypothesize that they have not explored well enough increasing the depth of their baseline models.