Paper ID: 1067 Title: Dynamic Memory Networks for Visual and Textual Question Answering Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes an evolution over memory networks and dynamic memory networks. The main contribution concerns a new parameterization of the memory module. It is shown that this leads to very good results in VQA (and on the bAbI tasks but those are less meaningful). Clarity - Justification: The paper is clear and easy to follow. Minor comments: * The bAbI-10k is introduced in line 117 without any citation (the citation comes much farther in the paper). It would be nicer with the citation there. * There is a contradiction. In line 215 it is written that the sentences are encoded by a GRU whereas in line 254 it is stated that a positional encoding scheme has finally be chosen. This should be fixed. * In line 633, one refers to the "Xavier initialization". What is this? Significance - Justification: The search for the best performing architecture for memory-augmented neural networks is an important one these days. This paper proposes to continue this exploration. Its main strength are the excellent results in VQA. Without such results, on a real data task, the significance would be much lower. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is rather clear and the architecture is well explained. Are they any plans of releasing the code? The experimental part could be improved by providing more insights on the model performance: * the bAbI tasks has been designed for allowing an easier detailed interpretation of results. But there is no such thing in the paper. The performance of DMN2-3-4 on the bAbI tasks are rather identical if one looks at the mean error. But some conclusions could be made by studying the individual tasks in more details. For instance: - DMN4 is much better than DMN3 on QA3 (3 supporting facts) 1.1 vs 9.2. Why is that? Does it mean that a GRU mem update can not deal with more than 2 supporting facts (DMN2 is even worse there)? - DMN4 seems to be especially better than E2E for tasks 17 & 18. What causes this? * It would be nice to give more explanations on what are the major factors of the good performance on VQA. Which choices end up being the most important? Does the choice of the snake-like memory ordering impact the performance? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors describe an improved Dynamic Memory Network (DMN) architecture, which is designed to process a 1D sequence of facts, a query, and produce an answer. The model is made up of one neural network with 4 modules (Input, Question, Memory, Answer). Experimental evaluation is performed on bAbI-10k and the Visual Q&A (VQA) datasets, and in each case the authors show performance competitive with or slightly outperforming the state of the art. Qualitative visualizations of attention over the image in the VQA task look sensible. Clarity - Justification: Well written paper, relatively easy to follow, well structured. Significance - Justification: I would categorize this as an incremental advance, provided prior DMN/E2E work. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is a solid and easy to read paper. A few components of DMN are slightly enhanced, and the model is also evaluated on VQA using a straight forward and somewhat inelegant extension (see more details below). However, the new architecture shows improvements over previous versions of the model and compared to state of the art. A few more comments: - The End To End memory networks model seems highly related and is cited often throughout this paper. It might be worth making the differences as clear as possible. For example in the Related Work section, this work is only mentioned very briefly with 1 sentence, in passing. Similarly, sentences such as “However, unlike the DMN their input module computes sentence representations independently and hence cannot easily be used for other tasks such as sequence labeling” are not easy to understand right away, and should be made more explicit and possibly expanded on. - I find the snake-like encoding of 2-dimensional objects (images) proposed in this paper to be highly inelegant. You are losing a lot of spatial information vertically, though I see that this is acknowledged by the authors. This is problematic because it is cited as one of the contributions of this paper, and this component is also not evaluated with respect to simpler baselines, such as one that could merge all of 196 512-D vectors into one 512-D vector using an average. - In general, I wish there was more to take away from this paper, other than seeing an end-to-end model that gets good numbers in the end. The authors make some attempt to debug their model in Table 1, and more analysis like this could make the paper stronger. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose to use DMN to address the question answering problem, including text qa and visual qa. The authors introduce a new input module for DMN by using bidirectional GRU/LSTM to encode the inputs instead of using separate encoding for each input sentence. Using bidirectional GRU to encode inputs can help get better representation. Experiments on both nlp qa and vqa tasks show some promising results. Clarity - Justification: The presentation of the paper is clear and easy to follow. The authors introduce their proposed modification to previous DMN for both nlp and vision inputs clearly. Extensive experiments are conducted and have shown some promising results. Some of the experiments design and comparison are not very clearly explained. See below for detailed comments. Significance - Justification: Clear as the paper is, the technique contribution of the paper is very limited. Although the paper gets some promising experiment results, the main contribution of the paper is to use an existing model/method to solve some problems with minor modification to the input module. There is not enough technique contribution in the paper as for a conference the emphasizes methodology. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. The bidirectional encoding of text is well justified while the encoding of image features is not. The input image feature is 512x14x14, where 14x14 is the position information of the images. What is the meaning of using a single dimensional GRU to encode the image features? Which order would you sort the image features? Row by row or column by column? A two dimension GRU is a better fit. 2. The experiment design of bAbI is questionable. In order to show the advantage of the fusion module of this paper, why don't the authors experiment with the bAbI task with supporting fact labels and show improvement over the results in the previous DMN paper? The authors instead choose to experiment with the experiment without attention signal supervision, this is questionable. The ODMN results for bAbI task is highly doubtable with error of 11.8 while E2E error rate is 4.2, this is significantly higher. Does this mean that DMN without fusion module is much worse than E2E, then why does the original DMN paper performs better than E2E? 3. The DAQUAR-ALL result should compare with previous results. 4. Why the update gate is related to the attention probability? The authors should provide more reasons for the design. Overall I think there is not enough technique contribution in the paper and there are some doubts for the design and experimental results, I would rate it as weak reject. =====