We thank all the reviewers for their time and helpful comments.$ #Reviewer_1 (1) "key differences between the two frameworks" Yes, it is an alternative framework. The framework of [Kingma, Rezende, Mohamed, Welling, 2014] still follows the framework of VAE (variational auto-encoding), and the generative process is: h->(x,y). Though there is class label y involved in the variational distribution q(h|x,y), the prior distribution over the hidden variable p(h) remains as the standard Gaussian prior N(h|0,I). While in our framework, we extend the VAE framework so that the generative distribution p(h|x) is able to be conditioned on observations x, and the generative process is: x->h->y. Hence, the generative distribution p(h|x) is no longer a standard Gaussian prior but a parameterised diagonal Gaussian distribution N(h|µ(x),σ^2(x)), which is jointly learned with variational distribution q(h|x,y) during training. Thanks for pointing it out. We've updated this part in our latest version. #Reviewer_3 (1) Section 3 & 4 "LSTMs for NVDM, bag-of-words for QA" For document modelling, we use bag-of-words model is because we want to keep the conditional independence assumption (LSTMs would break the assumption) so that it is directly comparable to the other topic models. For QA task, the vector representations of single words are not sufficient, so the distributed representations of sentences modelled by LSTMs are required in order to provide a deeper language understanding. (2) Section 5.1 & 5.2 "document classification" We're very willing to explore this point. In addition, in table 1(a), the fDARN [ Mnih & Gregor, 2014] is the model that 'replace h with a discrete latent variable' mentioned by the reviewer. "experimental procedure" The experimental procedure follows [Mnih & Gregor, 2014]. We build a validation set by removing a random subset of 100 observations from the training set for early stopping. We also apply random restarts, but the adam optimizer always gives us almost the same best result for each trial, so the initialisation is less important in this case. We've updated this section in the latest version. (3) Section 5.3 & 5.4 "varying the size of training data" This is a very helpful suggestion and we plan to update this section. Here we attach the MAP scores of our preliminary experiments by varying the size of WikiQA training data. We also present the results achieved by removing the 40% dropout at the last column. DataSize 5000 10000 15000 full full(no drop) LSTM+att 0.6079 0.6549 0.6808 0.6855 0.6760 NASM 0.6119 0.6609 0.6842 0.6886 0.6806 According to this table, we haven't observed more pronounced benefits on small datasets, but NASM still keeps around 0.4% lead on each different settings. Certainly, it is very difficult to push the performance further in this QA task. But the most significant thing is that the introduction of stochastic units is able to make a difference even compared to the state-of-the-art LSTM+Att. Another reason is that, the LSTM+Att has been fine-tuned by us. The improvement of NASM on the validation dataset is very prominent, but it turns out to be less significant on the test dataset. Given the training set is very small, we suppose the improvement is non-trivial. In addition to dropout, we also explored the L2 regularisation for the MLPs in NASM but there was no improvement. "Hinton diagram" We actually have a table of stratified MAP scores based on different question type. how: 0.5240 (1314) when: 0.5650 (457) what: 0.7144 (3168) where: 0.7334 (471) who: 0.7739 (755) where the number in the bracket is the count of the corresponding question type appeared in the test set. Hence emperically, 'how' questions are harder to 'understand and answer'. We've further clarified these points in the latest version. #Reviewer_4 (1) "generating a document from a latent continuous vector" Yes. LDA can also be thought of as generating a document from a latent continuous vector. The key difference lies in the generative distribution of the words. We've been informed that another research group is working on applying neural variational inference for LDA and its variations. It'll be interesting to see the comparisons. (2) "document lengths N" Yes, we were somewhat lax in leaving our the document length component of the generative model as this is always observed in the training and testing scenarios we explored. We are happy to include this for completeness in the final version.