Paper ID: 665 Title: Ask Me Anything: Dynamic Memory Networks for Natural Language Processing Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Claims that "most, if not all, tasks can be cast as a question answering problem" Describes a system that makes use of recurrent networks, attention, and an episodic memory module. Describes experiments on sentiment analysis, POS tagging, a simple question-answering domain. The main novelty appears to be the episodic memory component. Clarity - Justification: See below Significance - Justification: See below Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The novel contribution of this paper appears to be the episodic memory component. However it is very poorly explained - roughly a half column of text. The other components seem to be fairly standard in the literature. The authors need to do a much better job of describing what is novel, and giving a thorough and precise explanation of the novel aspects of the work. The results on sentiment analysis and POS tagging are nothing too surprising. The QA results look good but it's a very toy domain. I think the first paragraph of the paper is somewhat bizarre. To first say that most tasks in NLP can be cast as QA problems, and then to say that translation is a QA problem where the question is "What is the translation into French?" does not seem useful. By the same token any problem in machine learning is a QA problem ("what is the label of this image?" or "what is the 2d map given this set of sensor measurements"). Or to push this further, any problem in mathematics ("Is P equal to NP?") or the sciences. The implication that the system the authors describe can solve any QA problem therefore any problem in NLP is again not useful. If there is some concrete level of abstraction the authors are intending here (for example learning to map pairs of strings to some other string) then they should be explicit about this. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a memory-based neural network, where multiple passes over the long-term memory are performed with attention. Each pass gets fed to a temporary ("episodic") memory which decides where to look next. Clarity - Justification: The paper is relatively clear, except I got stuck on section 2.3. I was trying to understand the functional form of the episodic memory module from the words in the beginning that section, but the information wasn't there (it was described later on). Some sort of signaling would be helpful. Also, there is duplicate material at lines 330-338. Significance - Justification: There has been a lot of recent work on memory networks, and the proposed model in the paper looks very good. I didn't give this "Excellent" significance, because the empirical results, while good, are not spectacular. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, a very nice paper. This seems like the right way to build a memory over NLP What would make this a spectacular paper: 1. How does this scale when the input memory gets very large? It's unclear whether this can scale to answer questions on Wikipedia articles (or, even more ambitiously, over all of Wikipedia) 2. bAbI is a good first step, but it's still pretty limited. Incremental improvements on sentiment are also nice, but not spectacular. What would make this amazing is to show this sort of memory network can solve a problem that was previously though too difficult for NN (and solve it better than other techniques). Entailment? QA? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a memory architecture, DMN, for question answering consisting of: 1) input encoder, 2) question encoder, 3) episodic memory module, and 4) output decoder. The main contribution of the paper is in 3) as 1), 2), and 4) are not new. The argument behind episodic memory module is that the useful parts of the input that need to be attended to may not be clear in a single pass (read a forward LSTM/GRU) and hence the "memory" needs to be revised by making multiple passes over the data. This is a reasonable argument but it is not clear to what extent -- more on that later. The episodic memory module basically computes a memory vector by going over the input multiple times -- each iteration consists of one recurrent net over episodic vectors (with attention over the encoder LSTM) and the memory vectors are also modeled with RNNs which are stacked layerwise (and not temporally). The authors present experiments that show that DMN are competitive when compared with MemNNs (weston et al) on an artificial QA task, and obtain state of the art results on sentiment detection. The authors also present some analysis to convince that the episodic memory part is indeed crucial to their model. Clarity - Justification: Summary below. Significance - Justification: Below. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Pros: 1) The paper presents an interesting model for modeling memory architectures which should be showcased. 2) The empirical analysis of episodic memory sharpening the attention focus is nice. Cons: 1) Clarity is a big issue with this paper. The first paragraph of introduction (any task can be modeled as a QA) is quite misleading. How does it bear on anything in the paper except for the task of QA itself? Any problem in the world can be posed as a natural language question. I also find the description of the episodic memory module quite confusing and hard to follow. What is the motivation for modeling the memory module the way authors did? Why a RNN over episodes and then "stacked" RNN over memory? As a contrast, the closest work to this paper -- MemNNs -- are far simpler and easily understood. The authors should ideally cut down space on some other parts and spend more time detailing the memory module. 2) Comparison with MemNNs -- I understand that MemNNs treat sentences independently unlike DMNs. It would be nice to have some examples where this gives an advantage to DMNs (also because DMNs seem to be doing worse on QA with "three supporting facts" where I'd expect episodic memory to help). 3) I am not sure if I quite see how the memory aspect is crucial for POS tagging. 4) Minor: No comparison with the approach of Hermann et al -- why not just compare with a simple model that first encodes the question, and then uses that to analyze the input. I am willing to believe that DMNs are more expressive than this but a comparison would be nice. 5) Minor: No discussion of time complexity. How fast is it to run DMNs? Suggestions: 1) I am not convinced as to why we need more than 2 passes over the input -- one to see the entire input and then one to update the attention as later parts of the input may change the weights of the earlier part. Some analysis of this can be helpful. 2) Lines 334-338 are repeated and need to be omitted. Overall I think this is a good paper but can benefit greatly from better writing. I'm on the fence for this but I'd much rather have it resubmitted and have more impact than be submitted in the current form. ===== Review #4 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a new architecture called Dynamic Memory Networks (DMN) for NLP tasks, e.g., question answering. The paper discusses the novelty of the architecture, and outlines experiments that are performed on different NLP tasks to show the efficacy of the approach. Clarity - Justification: The problem is clearly formulated, and the model description and experimental analysis is clearly written. Overall, the paper is a very good read. Significance - Justification: The DMN model proposed here is quite novel -- it carefully motivates different components of the architecture, and shows how each component is necessary to solve the underlying NLP task. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Here are the main strengths of the paper: 1) The DMN architecture is a novel architecture for question answering and other NLP tasks. 2) The experiments show the effectiveness of this method, compared to recent models, e.g., MemNNs. 3) The qualitative analysis of the attention component of the model is very insightful. One suggestion for improvement -- the authors talk about DMN "significantly" outperforming MemNNs ... it was not clear how significance was measured, e.g., by t-test from repeated trials? Overall, DMN seems to be a promising model, and the architecture appears to be appropriate for other NLP tasks. =====