All reviewers:$We thank the reviewers for their feedback. We have two overarching motivations in this paper. 1.) Lake et al. see the one-shot generalization properties as a ‘challenge for neural models’, and we aimed to show that this is not the case. Furthermore, this is possible in models that use only limited forms of domain knowledge.
2.) We wish to stress-test our models to better understand their limitations. Work in deep generative models has progressed significantly in recent years and we show that the established methodology, with some modifications, is mature and has many applications beyond those explored in the initial papers.
Reviewer 1:
The complexity of the model is no different from most deep generative models and scales linearly with the number of generation steps. In practice, this means that the results we show have been trained for 4-6 days using 4 GPUs and small minibatches. We did not explore hyperparameters extensively. The most important are the number of computation steps (80 - 100 worked well for us), the choice of the canvas transition function (additive worked well in all cases) and the type of attention used (spatial transformer being the most generic).
The strong generalisation task is very hard and you are right that the results are closer to the unconditional sampling. This test is highly subjective and common features are sometimes possible to see, though not always. This results shows that there is much more that can be done to improve these models.
Reviewer 3:
The latent variables of the model are the entire set of variables z1, …, z_T. While the conditional distributions per time step are diagonal Gaussian, the joint distribution of the collection is not Gaussian, due to the *non-linear* dependency, which induces multimodality. This is an important property for variational inference, since without modifications to the model structure (e.g., auxiliary vars) or the bound (e.g., Monte Carlo objectives) we can compose tractable building blocks into richer distributions that allow for accurate inference.
This was not well explained in the version under review, and we have already made improvements to make this clearer and more reproducible. The conditional and unconditional models do not share parameters - the underlying models are the same, but they are trained independently. As you correctly say, the conditional models gets trained using pairs of characters chosen at random from the training data set with one of the characters used for conditioning. For the weak generalisation test, the pairs are 2 exemplars of the same character, for the strong generalisation test, they are 2 exemplars of the same alphabet.
Reviewer 4:
Thank you for the rant. We agree with you in many ways, which is why we prefer the more precise description of (conditional) ‘density estimation’ rather than ‘unsupervised’ tasks. We use the term one-shot since this is what is commonly used and allows for the contrast to Lake et al. and incorporating perspectives from cognitive science. Since we don’t learn from the exemplars at test time, we could instead say zero-shot. This begins to expose the differences between one-shot generalization (and inference, which we do), and one-shot learning (which we do not do), and is now discussed in our paper. Significant improvements have been already made that we think address the concerns of your review.
We use the fixed, binarized MNIST data set that consists of 50k training, 10k validation and 10k test points, and we try to carefully account for this. This is the same data set used by many papers including Salakhutdinov and Murray, Rezende et al., Gregor et al., etc. All the results in our table use the fixed binarization.
Thank you for correcting the number of steps for the DRAW results (which we will update).
We will also make the diagrams and corresponding explanations clearer and more explicit.
We agree on table 3. We report the numbers on the 28x28 omniglot data set to show the progress made in modelling this dataset and the effectiveness of our approach, mimicking table 1, which has similar limitations. We can also add results for convolutional VAE (though they will not achieve the performance we report here).
The poor test performance of the CGRU for the last entry in table 1 is due to the fully-connected attention, which increases the difficulty of the learning problem.