Paper ID: 804 Title: Neural Variational Inference for Text Processing Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper describes two applications of latent variable models for text modeling problems, one generative (document modeling) and one conditional (predict whether an answer is correct given a question-answer pair). A variational autoencoder approximation procedure is used together with a Guassian variational posterior with diagonal covariance, but the parameters are estimated using deep neural networks. This paper is valuable because it shows more convincingly than the original variational autoencoder paper, some of the practical applications of generative models and shows that despite high-dimensional models with large output spaces can be effectively modeled. It also shows that the approximation is useful not only in joint or unconditional models, but in conditional models as well. Clarity - Justification: The paper is generally quite easy to read. However it is repetitive in places (while coming back to the same common ideas repeatedly is definitely good, some of this tends in the direction of copy-paste language, which is a bit distracting). Overall, the paper is admirably clear and it would be relatively easy to reimplement the ideas based on the contents. Significance - Justification: Generative neural models of text are important, and the showing that the variational autoencoder idea can be used for inference in them is a very useful contribution. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): “the large and deep models which are now fashionable”- some intuition for why, despite such deep and fashionable models, we still need generative models with “upstream” latent variables would be a welcome discussion in a paper like this. For example, it’s perfectly easy to define a pretty good document model by generating a sequence of words using an RNN. "to approximate the intractable distributions over the latent variables” - give some characterization of what this approximation looks like. is there a particular parametric form of the density that is being assumed here? (just helpful to the readers so they don’t have to wait until late to get to the good stuff) Some brief discussion in the introduction about the limitations imposed by the reparameterization trick would be welcome. For example, multivariate Gaussians are easy, but this won’t work for discrete latent variables. Oh wait, this comes back around 237-244. I think this is a bit speculative. Maybe it would be more appropriate to have this in a footnote when the reparameterization trick is first mentioned? "A primary feature of NVDM is that each word is generated directly from a dense continuous document representation”. I’ll grant you that in neural document models, discrete binary representations have been quite common, but LDA can also be thought of as generating a document from a latent continuous vector! Including the actual model form (e.g., p(x|h)p(h)) and distinguishing that from the “inference form” might also help with clarity. "which generates all the words in a document independently” maybe more clearly “conditionally independently given h”? Section 3. NVDM. maybe make it explicit that the document lengths are given? Poisson draws would also be a way of dealing with length (see e.g., Guimaraes 2003). Line 771-772 "for every different conditions” doesn’t make sense. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents two neural variational models based on the framework of [Kingma, Rezende, Mohamed, Welling 2014]. The first is a bag-of-words document model and the second is a question answering model. The general idea is to employ a neural network to output parameters that control a generative model of the data. Thus alternatively, we can interpret the models as having stochastic Gaussian units for the hidden layer of an auto-encoder. Experimental results indicate that both variational models are superior to their 'deterministic' (non-stochastic units) counterparts. Though, the experiments do not address why they are better. Clarity - Justification: The paper is clearly written with one exception. It is unclear whether the purpose of the paper is to propose an alternative framework to Kingma, Rezende, Mohamed, Welling or to demonstrate that existing framework is suitable for NLP, or both. Either way, the paper can make a valid contribution, but if part of the goal of the paper is indeed to propose a new framework, it would be helpful to describe the key differences, where appropriate, in Section 2 and/or related work. Significance - Justification: The paper demonstrates that the relatively new neural variational framework is suitable for NLP tasks and indeed works better than more conventional deep learning approaches. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The neural variational framework affects both the training and inference of a traditional neural network; it would be interesting to see whether training or inference (or both) contributes to the improved performance. You speculate in the discussion that it is due to a regularization effect. Have you done experiments to confirm this? Specifically, if it's a regularization effect from the Gaussian prior as you suggest, then directly applying a regularizer to the hidden weights of the deterministic network during training should result in similar improvements. If not, then perhaps the performance increase is due to the inference procedure itself being more Bayesian in the sense that multiple samples are taken in order to make a prediction at test time (this would be interesting as well). ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes two models: (1) NVDM is an unsupervised bag-of-words model for documents in which words are generated from a latent continuous vector (2) NASM is a supervised classification model for question answering which uses a latent probabilistic attention mechanism. The motivation behind the proposed models is increased robustness to overfitting, compared to non-probabilistic approaches. Neural variational algorithms are used for inference (Rezende et al., 2014; Kingma & Welling, 2014). The proposed models and associated inference methods are empirically validated and found to outperform several strong baselines. Clarity - Justification: Overall, I found the technical presentation fairly easy to follow (with helpful figures and clear descriptions). My main gripe with the clarity of the paper is that some of the claims are oversold and some statements are quite vague or unsubstantiated (see detailed comments). For instance, it's not clear to me that the specific variational inference algorithms proposed constitute a framework for text modeling or the stronger claim that this extends to NLP in general (i.e. structured prediction). Significance - Justification: For both settings (unsupervised and supervised), it seems that one of the main claims is that the proposed models using a variational inference network should do better as they are more resilient to overfitting yet more expressive than fully collapsed models like LDA. I find this to be good motivation and quite plausible. However, the experiments don't really explore this question. For one, the supervised QA task shows marginal improvements over a baseline non-probabilistic model (LSTM); my guess is that the improvements aren't statistically significant (maybe good to report this given the small size of the datasets). To explore the overfitting question, it would be interesting to see performance as the amount of data is varied for both tasks. It would also be interesting to explore different ways of regularizing the non-probabilistic baselines (e.g. LSTMs). It is stated that "40% regularization" is used for the LSTM, but I think that this is a key parameter to explore given the claims of the paper. Different types of regularization such as L2 would also be interesting to explore. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Section 2. - “We define a generative model with a latent variable h, which can b considered as the stochastic units in deep neural networks.” I found this sentence to be quite confusing, and it would be useful to either make it more precise or to drop the analogy. - “can be any type of deep neural network”: do the networks really need to be “deep”? Section 3. (NVDM) - The distribution over document length is left unspecified so you couldn’t sample documents from the proposed model without fixing N in advance. This is a technical point but probably warrants some discussion, such as saying that N is always observed so its distribution doesn’t matter for purposes of the experiments in this paper. Section 4. (NASM) - The jump to using LSTMs to model text compared to bag-of-words in the previous model is abrupt and probably warrants a sentence or two of explanation. In particular, why not use LSTMs for NVDM (it has been done before for document modeling)? - Would a bag-of-words model as in NVDM work for the QA task? Why are LSTMs necessary? It would be nice to experimentally validate this modeling choice since it isn't really justified in the main text. Section 5.1 & 5.2 - Table 1 (a): LDA “dimensions” (presumably topics, this should be clarified) aren’t equivalent to the dimensions in the other models. Perhaps a more useful comparison here is the number of parameters in each of the models. - Table 1 (a): What does bold mean? Given the size of the datasets, it would be good to know how significant the differences are (i.e. . - It would be interesting to see if the proposed method is useful for document classification, which 20newsgroups is often used for. If not more results, then I’d like to see some discussion of adapting the generative story for this case (another level on top of h? or just replace h with a discrete latent variable?) - Table 2: incorrectly describes LDA as a mixture model - “we use the variational lower bound […] to compute perplexity” -> to approximate / lower bound perplexity - The experimental procedure isn't fully documented. Was the model trained multiple times with random restarts and the best run on training data taken, according to perplexity? How many restarts were necessary? Was a separate development set used for as a stopping criteria for SGD? - Why not compare to a regular language model? E.g., and n-gram model or LSTM. Section 5.3 & 5.4 - One of the main claims seems to be that the proposed models outperform non-probabilistic baselines such as LSTMs because they are resilient to overfitting. To empirically evaluate this claim, it would be interesting to see the relative performance of LSTM vs. NASM as the amount of training data is varied. Basically, the performance difference between the LSTM and NASM seems very small (and probably not statistically significant?). Would 1 hour of annotator time be enough to push the LSTM performance up to or beyond that of the proposed method? Vice versa, are the benefits of the proposed more pronounced with less training data? I think investigating this would strengthen the paper. - The Hinton diagram in Figure 5 and the associated discussion in the text are quite confusing. There are several points in there which aren’t justified or are unclear, e.g. (1) the “how” questions being harder to “understand and answer,” (2) - It’s not clear to me why “knowledge of semantic ambiguities” would be useful for the QA task unless one is trying to capture the full posterior with confidence estimates in the answers (which isn’t evaluated, only the ranking). Unless I’m missing something, it might be worth clarifying this point. Section 6 - “sum over all the possibilities in terms of semantics” - this is so vague that it would probably be better to leave it out =====