Paper ID: 720 Title: One-Shot Generalization in Deep Generative Models Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a recurrent generative model which is a state-of-the-art generative model on various visual datasets and is able to perform one-shot generalization on a complex dataset of handwritten characters. The architecture is similar to that of DRAW, but uses spatial transformer networks to read from and write to the images. Clarity - Justification: The paper is well written, in terms of both the overall motivation and the specifics of the algorithm. Significance - Justification: The results here are very impressive. The experiments tackle the one-shot generalization tasks of Lake et al., for which those authors had used a carefully hand-constructed generative model which explicitly modeled the process of writing a character and which used timing information. The system here is able to achieve similar generalization when trained on raw pixels. Essentially, it learns from a large labeled dataset the manner in which one should generalize images. The architecture itself is mostly a combination of existing ideas, but getting them all to play nicely with each other must not have been trivial. The experiments are very thorough, considering a variety of tasks and datasets and including ablations to assess the importance of various design choices. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): See above for specifics. Overall, I find this paper convincing and don't have any major objections. I expect this will be an influential paper. It would be good to mention the computation time necessary to train the model. Also, how big of an issue was metaparameter selection? It's claimed that "looking at the model generations in Figure 11, the model is able to pick up common features and use them in the generations." But to me, the samples look very similar to the unconditional ones from Figure 10. What do they share with the top row? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents several extensions of the DRAW model from "DRAW: a Recurrent Neural Network for Image Generation", by Gregor et al. (ICML 2015). Experiments are presented which compare the merits of these extensions in several settings. One proposed extension of the DRAW model is to modify it for use as a sequential conditional generative model. The resulting model is tested on a "one-shot generalization" task, by training the model to generate random instances of a character class conditioned on a random instance from the class. Clarity - Justification: The presentation of the material is clear and largely builds on the existing DRAW work by Gregor et al. Significance - Justification: The proposed extensions to the DRAW model are relatively straightforward. A version of the DRAW model adapted for sequential conditional generation was previously presented in "Data Generation as Sequential Decision Making" by Bachman et al. (NIPS 2015). More specifically, renaming x^k->x' and x^u->x in the LSTM-based imputation model of Bachman et al. produces a model with a form very similar to the one in the current paper. The model proposed in the paper under review differs from this earlier model by adding spatial attention and removing the inference-only hidden state. The use of conditional generation for "one-shot generalization" is clever. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The general area of research considered in this paper is very interesting, i.e. sequential conditional generative models, spatial attention mechanisms, and stochastic variational inference. The conceptual novelty of the work presented in this paper is modest. The experiments on one-shot generalization are interesting, though it's unclear what we can learn from them. Here, the "one-shot generalization" task is equivalent to generative structured prediction with a deep directed generative model trained using stochastic variational inference. Several papers have previously proposed methods for this problem, none of which were compared to in the paper under review. The spatial-transformer variant of attention is a straightforward swap of the fixed Gaussian grids from the DRAW paper for the deformable Gaussian grids of Jaderberg et al. (NIPS 2015). The novelty of the work by Jaderberg et al. wasn't so much in their use of a deformable grid as in their application of the attention mechanism in a novel context (i.e. a feedforward convnet). Here, that novelty is lost by bringing the flexible attention mechanism back to the setting where the initial differentiable, but rigid, attention mechanism was introduced. The use of a convolutional GRU for the canvas is interesting, but what it contributes beyond the additive canvas is not sufficiently explored in the experiments. The GRU canvas performs better than the additive canvas in the binarized MNIST benchmark, but significantly worse in the 64x64 MNIST digit pairs test. Why? Is this a problem of overfitting, training difficulties, or random inter-task variability in the performance of each approach? It would be nice to see some discussion of how this use of a convolutional GRU relates to the way spatial LSTMs have recently been applied to mimic the behaviour of MRFs in computer vision. See, e.g., "Scene Labeling with LSTM Recurrent Neural Networks" by Byeon et al. (CVPR 2015). It's easy to believe that taking some burden off the "painter", by allowing the canvas to control for local consistency, could be beneficial. But, there's no strong proof of this in the experiments. The conditional version of the proposed model directly extends prior work. The main changes are the addition of a spatial attention mechanism and the removal of recurrent state just for inference. The former change is obvious in the context of DRAW, and the importance of the latter change isn't thoroughly explored in the experiments. A closer look at this latter point, i.e. if/when there's any benefit to the inference-only recurrent state, would make the experiments more informative. The value of models with multiple steps of latent variable sampling for structured prediction problems, like generating random variations of an input character, has been explored previously in, e.g. "Learning Structured Output Representation using Deep Conditional Generative Models" by Sohn et al. (NIPS 2015) -- inter alia. While the "one-shot generalization" experiments are an interesting application of deep, generative structured prediction, the underlying model is an incremental step from existing work. ------------ Other Notes: ------------ It's not clear what protocol the authors followed in their binarized MNIST experiments. There are two versions of this dataset for which multiple results have been published in significant papers. One version uses a fixed binarization of the digits, and the other involves binarizing the digits on-the-fly. Table 1 includes results from both versions of the dataset, and the experiment description in the main text doesn't clearly state which version of the test was performed. The Salakhutdinov citation suggests the dynamic binarization, but the strong dependence on the DRAW paper suggests the fixed binarization. The stated numbers of steps for the original DRAW results in Table 1 are wrong. The original paper used 16 sampling steps (at least, based on examination of their figures) in the attention-free model and 64 steps (according to their Table 3) in the attention-based model. The illustration of the conditional generative model in Figure 2 is a bit hard to interpret and should include some mention of what the various symbols represent. A detailed description of this model was not provided in the main text, so this figure should convey more information. The results in Table 3 represent a rather unfair comparison between permutation-invariant models and a model tailor-made for images. Comparing to, e.g., a convolutional VAE/IWAE would make more sense. Results in Table 1 for the permutation-invariant sequential model with a convolutional GRU canvas seem unreasonably poor. A basic permutation-invariant DRAW-like model with _no_ canvas easily scores < 88.0 on this benchmark. This suggests that perhaps hyperparameters weren't tuned for each model, which makes it hard to draw any conclusions from these numbers. ---- Rant: It would be nice not to see the "one-shot ___" terminology so often. Here, it just obscures the relations which the model is learning about through supervised pairing of inputs. The Omniglot tasks in this paper are structured prediction problems with multi-modal output distributions. The model is trained in the standard way -- via supervised learning with (input, output) pairs. In this setting, producing an appropriate output distribution for an input which hasn't been encountered before is generalization in the standard sense. The information the model must learn to perform well at this task, i.e. class-preserving transformations in the context of Omniglot characters, is fairly sophisticated. It's cool that the model can successfully distil this information from the (input, output) pairs provided to it during training, but the task is neither one-shot nor unsupervised. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces a new architecture of a Variational Autoencoder that is similar to the previously introduced DRAW model, but differs from it in a couple aspects (most prominently, in the exact attention mechanism, and in the use of a canvas in the write operation for the generation). The proposed architecture is found to perform very well on the task of image generation on several datasets. A related conditional model is trained to generate novel samples of OMNIGLOT characters given one example. The generated samples are found to be visually compelling. Clarity - Justification: The work is presented clearly for the most part. There are some things that remain unclear: 1. The posterior is said to be ale to represent jointly non-Gaussian posterior distributions. However the equations 10 and 11 indicate that the posterior is actually Gaussian: sample z_t is independent of z_{< t} given x. 2. The detail of the training procedure for generation of a novel sample of a character conditioned on an example of this character are not fully explained. Does the conditional model share any parameters with the unconditional model? Does it get trained on a dataset of all available pairs of samples characters, where one of the characters in the pair is used for conditioning, and the other one for prediction? The experiment details are somewhat vague, and hence the experiments are not fully reproducible. Significance - Justification: The task of one-shot generalization is an interesting one, and there hasn't been a lot of research on this topic. This paper suggest a generic approach to this task. The approach basically consists of training a state of the art conditional generative model of previously available examples. The paper shows that this approach is actually enough to generalize to previously unseen examples. The paper also contributes to experimental data on variants of the DRAW model and attention mechanisms employed within it. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The reviewer recommends to conditionally accept the paper based on its contributions towards the task of generating related examples from one example. The acceptance should be conditional on the authors clarifying the points mentioned in part 3 of the review (the shape of the posterior, and the experimental setup for conditional generation). =====