We thank all the reviewers for their valuable comments. We’ll improve the paper in the final version. Below, we address the comments and clarify the misunderstandings of R3.$
To R1:

Thanks for the suggestions. We’ll publish our code after blind review. We indeed tested VAE with a 530-530-100 architecture (1559K parameters) and the log-likelihood trained with 1, 5 and 50 importance samples on MNIST are -85.69, -84.43 and -83.58 respectively, while the results of MEM-VAE with a 500-500-100 architecture (1550K parameters) are -84.41, -83.26 and -82.84 respectively. We can see that using our memory leads to much better results than simply increasing the model size. We’ll add these results in the final version.

To R3:

Q1: Memory formulation:
The comments on the memory are likely misunderstandings. In fact, our memory is parameterized by a pool of slots (i.e. a matrix M) instead of a gated linear layer, and the gated linear layer is the function f_c that combines information from memory and generating process.

The advantages of external memories over memory units in LSTMs are extensively studied in recent work, e.g., Neural Turing Machines (NTM) (Graves et al. 2014) and Memory Networks (Weston et al. 2015), which are discussed in Sec-2. Besides, these two types of memory can complement each other, e.g., both are used in NTM.

Q2: Comparison with DRAW:
Our extensions of MEM-VAE over VAE are orthogonal to DRAW (See L165-170 and L178-185). Basically, DRAW uses a visual attention and LSTMs to “draw” the track of objects (suitable for digits), while we employ external memories to encode the local variants of data (suitable for faces). As discussed in the response to Q1, different types of memory can be complementary and we think DRAW can be improved by employing external memories, which is our future work. This paper focuses on the most direct competitors without visual attentions (VAE &amp; IWAE).

Q3: Extra constraints to the loss:
We guess the constraints mean the local reconstruction error terms. Indeed, these terms are optional to our model (See L466). We also tested MEM-VAE without these terms and the log-likelihood is -84.44 on MNIST (See L646-655), almost the same as the -84.41 reported in Table 1.

Q4: Model capacity:
As stated in L117-132 and agreed by R4, our memory can capture the local details that are often lost in the encoding (recognition) pathway, and be retrieved in the decoding (generation) pathway. Thus, the encoding pathway can focus on abstraction and the decoding pathway can recover inputs better with the extra information retrieved from the memory. Our results convincingly support this argument (See L637-645 and our response to R1 for comparison to a larger baseline).

Q5: Time complexity:
As in Sec-4, we train MEM-VAE end-to-end efficiently. It takes almost the same number of epochs for both VAE and MEM-VAE to converge, and one epoch of MEM-VAE takes at most twice time as VAE. Besides, we can speed up significantly without local reconstruction error terms, which are optional (See the response to Q3).

Q6: What does memory learn:
MEM-VAE has much nonlinearity and it’s hard to visualize the memory directly. Our results in Fig. 3 disclose some insights of the learnt patterns. We further analyzed a simpler model, where f_c is element-wise addition and f_a has softmax nonlinearity. This model has log-likelihood of -84.68 on MNIST, slightly worse than -84.41 of MEM-VAE but still better than -85.67 of VAE. We map each memory slot in layer one down to an image and compute the average activations over classes. We found that most slots respond to one class or similar classes and the corresponding image contains a blurry sketch of the digit with different local styles, which indicates that the external memories can encode local variants of objects and can be retrieved based on generative information h_g. We’ll add these results.

To R4:

Q1: More datasets:
Thanks for the suggestion. We compared to the most direct competitors (VAE &amp; IWAE) on MNIST and Frey faces datasets for a fair comparison to their published results. To compare with a broader family of models, we also evaluate MEM-VAE on the OCR-letters dataset. Based on our experience, we agree that memory could be more helpful on more challenging datasets. We’ll investigate our memory in more models such as DCGAN (Radford et al. 2015) and LAPGAN (See Sec-6).

Q2: Evaluation:
In fact, MEM-VAE consistently outperforms VAE under various criteria. For visualization, we prefer the samples that can capture the local details of the main objects well, and we tried best to compare all models fairly by majority voting of several volunteers. The better results of log-density estimation (Table 1) and missing value imputation (Table 3) reflect the ability of MEM-VAE on capturing the “variety” of samples because these results are averaged over all the test data, which have diverse styles.

Q3: typos:
Thanks. We’ll correct them.