Paper ID: 551 Title: Learning to Generate with Memory Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose an extension to the variational autoencoder (Kingma & Welling 2014) that contains a form of memory. The authors show improvements over several other models. The notion of memory here appears to be a gated linear layer; there is no recurrence. For this complexity of model, I would have expected the authors to compare to DRAW (Gregor et al, 2015) which lies in the same model class and forms better on several of the data sets. Clarity - Justification: The model is presented piece by piece, which, being a rather large model, takes quite a bit of time and is difficult to put together. The experiments are well explained with lots of details. Significance - Justification: The propose model is an adds a gated linear layer as a form of memory, and introduces some extra constraints to the loss that appear counter to the principled variational objective. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is nice but the improvements over the included results seem slight, for the increase in the number of parameters. If results from DRAW are included in Table 2 (which also uses a form of memory for generative models based upon LSTMs), then the results here are far from state of the art. I think the paper would be improved by investigating exactly what "M" is. It looks like extra capacity of some sort is beneficial. Why is this particular kind useful? Is it faster to train than more complicated models? If it is memory, then what is it remembering that helps with subsequent tasks? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes to augment the deterministic layers in generative network of a VAE/IWAE with a memory module. It presents experimental results that indicate that this architectural change allows one to train generative models with one stochastic layer that achieve better log-likelihoods on the MNIST and OCR-letters dataset than equivalent models with no memory module. Clarity - Justification: The paper is clear. The results seem to be reproducible, but no source code is provided. Significance - Justification: The results are an incremental improvement over the previous body of work, but they do contribute towards the experimental results on different architectures for generative models. The main contribution of the work is in experimentally verifying that adding a memory module to the generative network in a VAE/IWAE doesn't lead to overfitting, and also allows one to train models that achieve better log-likelihoods. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): One omission of the paper is that the experimental baseline (VAE/IWAE without a memory network) has fewer parameters than the proposed architecture (with the memory network). This could be usefully remedied by adding additional deterministic hidden units to the baseline. The reviewer suspects this wouldn't change the comparison drastically, but it would be great to be sure. [Addressed in rebuttal] ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper present a simple and novel idea of introducing an external memory (with soft attention mechanism) to the generative pathway in a VAE. The same memory could easily be plugged into to a variety of alternative generative models, though the authors only explore its viability in the context of a VAE. The authors suggest that such an external memory can relieve the encoding pathway of retaining detailed information about an input and rather this type of information can be learned to be stored in the external memories of the generative pathway and retrieved during generation. This idea is similar to the ladder network in the sense that the ladder network introduces lateral connections so that increasingly higher levels of representation need not retain all the detailed information and rather extract higher level features. Clarity - Justification: The paper is clearly written and easy to follow. Significance - Justification: This method is simple, novel and can be easily plugged into any existing generative modeling framework. Given the recent surge in interest in generative modeling of images I think the paper is significant to the field. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): My only concerns about the paper are the limited datasets explored in the paper. The authors consider MNIST and Frey faces, two datasets that are quite well modeled by existing approaches. I would like to see the method applied to at least CIFAR10 if not additional datasets. Generative modeling approaches for images are rapidly advancing and there is a clear trend of moving beyond MNIST. Additionally, the proposed method should see even greater gains on more challenging image datasets. Figure 6 compares a VAE and the MEM-VAE on Frey faces dataset. The authors claim that the MEM-VAE generates more realistic samples. To my eye, they look comparable (though they had a pool of volunteers assess the quality and they favor the MEM-VAE). One thing I do notice is that there appears to be less variably in the MEM-VAE generations when compared to the VAE generations. Could the authors comment on this? Some minor typos: - page 1: "But *none* efforts have been ..." -> "But *no* efforts have been ..." - page 5: "Note that we cannot send the *massage* of a ..." ->"Note that we cannot send the *message* of a ..." =====