We thank all reviewers for their positive and constructive feedback and comment on the main points raised.$ * Concerns regarding novelty (R5, R6) The idea of distilling knowledge from a network (or an ensemble of networks) is not new and we made this clear in our paper (Sec. 4.3). Our work differentiates from the literature by targeting the specific problem of distilling an ensemble of networks that has been implicitly trained via dropout regularization. Given the impact that dropout has within the ML community, we think that this problem is worth dedicated investigations. Our contributions to advance the state-of-the-art include a novel algorithm that sidesteps the intractability of the dropout distillation procedure via stochastic gradient descent and a theoretical result that characterizes loss functions under which the proposed distillation procedure is legitimate. In addition we provide focussed experiments that demonstrate the practical validity of the proposed method, which in principle allows to improve the prediction accuracy of any pre-trained network (with dopout regularization) at no additional test-time costs. R6 pointed out the reference to Bayesian Dark Knowledge (BDK), which tackles the problem of distilling Bayesian posterior predictive distributions (pioneered by Snelson & Ghahramani in 2005) in an online fashion. The algorithm is close in spirit to ours but the problem addressed is in general different, albeit related. Moreover, our theorem could in principle be used to extend the algorithm in BDK to a wider range of divergence functions as we did in our work. We will include the suggested reference in our paper and discuss the differences. * Small improvements, performance loss under compression, missing baseline (R5) The experiments we conducted do not exhibit large margins of improvement for the datasets/network combinations that we analysed, but we show consistent improvements. The small improvements are due to the relatively small gap between MC dropout and standard dropout (our score is in general bounded by those two values). Nevertheless, there might be application scenarios where larger improvements could be achieved and it would be interesting to characterize such scenarios in a future work. When it comes to model compression, performance losses are expected in particular at high compression rates. Nonetheless, we think that the results we obtain by compressing a single layer of the network are actually good (fairly small losses). Instead, the results that we obtain when we compress all layers are worse than the single-layer case, under the same number of parameters. We believe that the reason is two-fold: the topology we get by compressing all layers uniformly is probably suboptimal; and also the way we initialize the weights (we start from the "standard dropout" network and we drop the units form there) might put the optimization dynamics in a suboptimal local minima. Note that the initialization from "standard dropout" weights is wise if the network topology stays unchanged or if layers up to a specific level are preserved (as in the case of single-layer compression). As for the suggested baseline for model compression, we have added it in our experiments. The results we obtain still underline the validity of the proposed method. * Make Thm1 more accessible (R6) We will give more insights in order to make the theorem more accessible. * Why using unseen data is the way to go for distillation? (R6) Training the distilled network by using the same training data used to train the original network might be less effective, because the original network delivers crisper predictions on the training set where it was trained on (even when dropout regularization is used) thus limiting the amount of "dark knowledge" that can be transferred. For this reason, distilling on previously unseen data (also test data falls in this category) is in general a better way to go. We will make this clearer in the paper. * Some details missing (R7) The disagreement function adopted in our experiments is the KL-divergence. We will make this clear in the experimental section. * Computational complexity (R7) We will discuss the complexity of the distillation procedure.