Paper ID: 1149 Title: Dynamic Capacity Networks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Paper introduces a novel way to dynamically use capacity for different parts of the input data. The idea is to have "low-capacity" and "high-capacity" subnets/feature extractors. The low-capacity nets are applied across the entire input and are supposed to guide which parts of the input should the high capacity nets be applied to. There is a gradient-based attention mechanism that uses entropy as a saliency measure. The idea is that using entropy will encourage selecting input regions that could affect the uncertainty of the model the most. There is an interesting twist whereby the "coarse" vectors and the "fine" vectors are required to have the same dimensionality because of the modeling constraints, and there some cost that makes them closer to each other (in one direction only). The paper gets very good results on the cluttered MNIST dataset and the model does sensibly on the SVHN dataset, but in a setting which no other paper uses, sadly. Clarity - Justification: Clear paper, nothing in particular to complain about. Significance - Justification: I think the attention model is new and interesting, or at least sufficiently different from previous approaches (esp. in considering contiguous parts of the input image). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): It's a bit of a pity that the SVHN (which is a more interesting baseline than cluttered MNIST), this is the only paper available in the setting of using the entire image (as opposed to the cropped bounding box). Have the authors tried implementing a few of the available baselines that do well on SVHN (like maxout nets) and train them on the full images? I think I would've liked a comparison with a strong baseline for the MNIST and SVHN tasks, whereby you train a regressor for the position(s) of the digits/characters and a classifier on the predicted box. Yes, it wouldn't be end to end, but it would still be something that people would typically try in such a scenario. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces a novel deep learning model capable of automatically adjusting its capacity depending on the complexity of the input. The motivation behind such an architecture is to reduce the computation cost without sacrificing the performance. The authors apply their model to the problem of digit recognition in images. The idea is to have 3 models, namely, a coarse model, a fine model, and a classifier on top. The coarse model is a smaller capacity model which is applied to the entire image. The output of the coarse model is also used to obtain an attention map, which identifies the potential local areas in the image where the fine model needs to be applied. The so called "fine model" is a more powerful deep network with larger capacity. Finally the top level classifier takes as input the feature maps generated by the coarse model and the fine model (which is only applied to a subpart of the image) and generates the classification. The authors apply their model to the problem of digit recognition on the cluttered MNIST digit dataset, and the SVHN dataset. They show modest improvement in performance over the scenario where the fine model is applied to the entire image. Clarity - Justification: The paper is very well written and easy to understand. Significance - Justification: I think the ideas proposed in the paper are fairly interesting. The authors provide a novel way to attend to parts of the image where the fine model should focus on. While a number of papers in the past have tried to do something similar, using the entropy measure to come up with a hard attention mechanism (eq 2, 3) is something quite new. The benefit is that it enables one to do end-2-end training as opposed to resorting techniques like REINFORCE. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I like the paper very much. There are a couple of questions which I think needs clarity. 1. If i'm correctly understanding, for every test image the model needs to compute the saliency map (eq 2 and eq 3). I wonder how expensive is this operation? And whether the time taken to compute this is comparable to applying the fine model on the entire image? The authors should really run a collection of experiments where they also report the wall clock time taken to recognize digits for their proposed model and other models. 2. The hints objective (eq 6) seems to be a bit ad-hoc and almost placed as an after thought. What happens when you don't use such an objective in your model training? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a model for image classification based on a two step conditional processing: a first step of coarse computations is used to select "glances" in which a more expensive network refines the previous measurements. The resulting architecture can be trained end-to-end using standard gradient descent and produces state-of-the-art results on cluttered mnist and svhn. Clarity - Justification: The paper is clearly written, the figures contribute to the good understanding of the model, and the notation is lightweight. The numerical experiments are clearly described and offer enough detail so that they can be reproduced. Significance - Justification: The model presented here is an interesting variant of previous attention mechanisms for conditional evaluation. The main advantage of the proposed model is its relative simplicity and scalability, thanks to the fact that it only considers hard attention. The numerical results are very good. My only concern is precisely that in some regimes the attention model might be too brittle. The authors propose the entropy of the conditional probability computed by the coarse model as a basis for attention. See below for extended questions on that point. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Here are some questions/comments: - The rule for the hard attention is to select the regions whose features produce the largest variations in entropy on the output probability of the model. This rule is appropriate when the coarse prediction is already pointing towards the correct class, albeit with large uncertainty. Indeed, reducing the entropy in that case increases the likelihood the model will make the right decision. However, what happens when the coarse model makes a confident decision that is wrong? Isn't increasing/decreasing the entropy in that case arbitrary if the goal is to find the perturbation that will bring the model into the right class? - The "hint" version seems necessary in light of the previous description. Indeed, since the attention is based on a derivative on uncertainty, this essentially forces the refinement to be a local perturbation of the coarse features. Is there any reason NOT to do the hint version? - This brings us to the question that perhaps the model is still "wasting" computation, the sense that fine-scale is by construction going to be extremely correlated with the coarse scale. This is in stark contrast with standard multiscale approaches, in which the fine scale coefficients are in the residual space. I am wondering whether the fine scale coefficients could be used in the final model by concatenating them with the coarse scale ones. In order for the model to exploit the location of these fine-scale features, one could append the spatial index of the patches as extra features (which results in a negligible increase of dimensionality). - I am a bit surprised to see that a simple feedforward model (the fine model in Table 1) does significantly better than the state-of-the-art in that dataset. How come this baseline was not tested in previous works? =====