We really appreciate the insightful and helpful comments from all reviewers. Below are our answers to the key questions/comments.$ To Reviewer_1: [Q1]: the contribution is limited with minor modification to the input module. [A1]: In this paper, we proposed novel modules for two crucial components in the DMN framework: input representation and episodic memory. Those proposed modules achieved significant improvement and state-of-the-art results on the Text/Image question answering tasks. This paper is also different in that the original DMN framework required supervision of supporting facts in training, that is too restrictive for generalizability to other modalities. In our framework, we do not need this supervision, allowing it to be generally applicable for different applications such as visual question-answering (VQA). [Q2]: what is the meaning of using single-dim GRU to encode image feature, which order, and 2D GRU is better fit. [A2]: In Line 411-413 and Figure 3, we mention the snake based traversal of the image which represents the order in which we sort the image features, the goal is to propagate information between the neighboring patches. This could use additional clarity however. We are interested in exploring a 2D GRU for a future longer version paper, which would better propagate information between patches. [Q3]: The experiment design of bAbI is questionable. [Q3_a]: Why not show improvements using input fusion with supporting fact labels [A3_a]: A key goal of this paper was extending the DMN to not require supporting fact labels. Supporting fact labels are too restrictive, a shortcoming that was also mentioned in the end-to-end (E2E) memory network paper. [Q3_b]: ODMN result for bAbI task is doubtable, and the DMN without fusion module is much worse than E2E and why original DMN paper performs better than E2E? [A3_b]: The original DMN paper compared with Memory Networks on babi-1k with supporting fact labels, not on babi-10k without supporting fact labels. The E2E model was also not compared against in the original paper. Additionally, as the ODMN is a baseline in our paper and doesn’t use support fact labels (like all models in our experiments), we would expect it to perform worse than in the original paper. [Q4]: DAQUAR-ALL result should compare with previous results. [A4]: Agreed, we'll add that in the camera ready version. [Q5]: Why is the update gate related to the attention probability? [A5]: The update gate in the GRU does not have access to the query or episode memory. By replacing the update gate with the attention gate, it is able to benefit from that added context. We will improve the explanation in that section and provide clearer reasons behind the design decision. To Reviewer_3: [Q1]: E2E memory networks model seems highly related to this paper, and worth making the differences as clear as possible [A1]: Extending the comparison between models is indeed a good idea and we will aim to expand that in the revision. [Q2]: snake-like encoding image highly inelegant, and lost spatial information vertically, and this component not evaluated with respect to simpler baselines that merge all of 196 512-D vectors using an average [A2]: We will add the baseline you mentioned in the revision. Note that our visualizations show that a surprisingly large amount of spatial information is still captured in this setup. [Q3]: more analysis like the attempt to debug the model in Table 1 could make the paper stronger [A3]: We will try to add more information like in Table 1, space permitting. To Reviewer_4: [Q1]: studying the individual tasks in more details, such as why DMN4 is much better than DMN3 on QA3, DMN4 is better than E2E for tasks 17 & 18. [A1]: We will add some additional intuition to the paper, space permitting. [Q2]: more explanation on what are the major factor of the good performance on VQA, does the choice of the snake-like memory ordering impact the performance. [A2]: We will add more discussion on the reason for the good model performance on VQA in our revision. We're also planning on adding more baselines to better show the improvement offered by the snake-like image traversal. [Q3]: clarity improvement such as the citation of bAbI-10k, Xavier initialization [A3]: We will update and improve the clarity of the paper in the revision.