Paper ID: 52
Title: Dropout distillation

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): Though training a neural net using dropout involves using sampled dropout masks, at test time the hidden unit activities are scaled deterministically based on the dropout probabilities. This "standard dropout" approach is used for efficiency reasons, and though it has been shown to be an accurate approximation in some cases, in other cases it can result in a significant loss in accuracy compared to Monte Carlo (MC) dropout, which involves averaging over many sampled masks. The authors aim to achieve the best of both worlds by distilling the implicit ensemble averaged over by MC dropout into a single deterministic model, usually having the same architecture. The experiments show that the performance of the resulting model tends to make considerably better predictions than standard dropout though is rarely matches the performance of MC dropout.

Clarity - Justification: The paper is well written and is very clear. 

Significance - Justification: This is a solid and interesting work, even if it is not completely original. The proposed approach shares many similarities with the method from Bayesian Dark Knowledge of Korattikara et al., where the predictive distribution computed using Monte Carlo samples is distilled into a neural network. The authors should explain how the two approaches are related and how they differ. Despite this algorithmic similarity, I think the problem addressed in this paper is sufficiently different to justify a separate paper.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is a bit more mathematical than it needs to be, which might limit its audience in the deep learning community. However, the algorithm itself is sufficiently clearly described that this is not a serious concern.  Still, it would be good to explain the essence of Theorem 1 and its proof in a more accessible manner, because the result is interesting and possibly initially counterintuitive.  I found the discussion of how to select the set of inputs to use for distillation in Section 4.2 somewhat weak. Surely no one would seriously consider using randomly Gaussian noise to generate such a set. It is also unclear to me why the authors think that using previously unseen samples is the way to go. If we simply want to compress/distill the trained model, wouldn't we want to make sure that the resulting model matches its prediction well on the training set (and the test set if available)?

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.):  The main contributions are: - This paper proposes a new way of creating an inference network after dropout training. Instead of just scaling the weights (the standard approach), a separate inference network is trained on the input-output pairs generated by running the dropout network stochastically. This can be done on unlabelled data, so it should not present new overfitting issues. - This method is shown to work better than scaling the weights by 1-p. - This can be used to train a more compact inference network, since the inference network need not have the same architecture as the dropout network. - Experiments show modest but consistent gains in performance on MNIST/CIFAR-10/100.       

Clarity - Justification: The paper proposes a simple technique for learning a better inference network. The paper is well-written with clear explanations.

Significance - Justification:  The proposed technique is an application of transferring knowledge from one network (or an ensemble of them) to another. This form of "distillation" or "model compression" has been previously explored (Hinton '14, Ba and Caruana '13) and is not very novel. However, the application in the context of dropout is novel which adds some significance. The experimental results show that the improvements obtained by this procedure are very modest. Unless leveraging unlabelled data gives a huge boost, this contribution is likely to be incremental.       

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.):   The main strength of the paper is that the proposed technique gives consistent improvements. It always reduces the error, bringing it closer to the "true" Monte-Carlo error rate. Another strong point is that the paper tries to legitimize the distillation procedure by doing a theoretical analysis to establish a relationship between the original loss function and the loss function used to transfer knowledge. It is shown that optimizing the transfer loss would help with the original loss.  The main weakness is that the improvements are quite small (Table 1). For model compression (Figure 3) it seems that there is a significant loss in performance when the model is compressed. A baseline for these compressed models would have been training a network from scratch with those many parameters. Including this baseline in Figure 3 would help understand if this compression gives us a better model than what we would have obtained if we trained a model of that reduced size directly.       

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): When using dropout, at test time, one has to choose between using a cheap inference by simply halving the weights, or the costly but more accurate Monte Carlo approach. This paper proposes to train another network to distill the "dark knowledge" from the Monte Carlo ensemble. Experiments show somewhat marginal improvement over traditional cheap inference without increasing the test-time computational complexity.

Clarity - Justification: The writing is fluent English and easy to read. Some details are missing, like the disagreement function l used in the experiments.

Significance - Justification: The paper is an incremental advance which tackles a problem that is an interesting research topic.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The training time computational complexity is not discussed.

=====