We thank all reviewers for their constructive reviews.$
[R6]
Thank you for noting other related papers. We will take a closer look at the papers and discuss them in the related work section. The following are some brief initial comments about the papers:
The 3D Maze task in Asynchronous DQN (Mnih et al.) is similar to the Single Goal task in our paper. However, unlike their work, we focused on designing tasks that allow us to systematically study the interaction among partial observability, active perception, external memory in different neural network architectures, and generalization across size and topology.
Zhang et al. will be included in the “model-free RL for POMDPs” paragraph, and Levine et al. will be included as one of the first successful DeepRL applications to Robotics.
We will add Watter et al. and Lange et al. and related work that introduce deep neural nets for model learning and feature learning for RL.

[R7]
We will provide additional motivation for our architectures in the revision. One motivation is that the importance of remembering a prior event depends on current context. For example, in our I-Maze task, the color of the indicator is important only when the agent has to decide which way to go at the end of the corridor. Thus, an ideal architecture should be able to “conditionally” retrieve an important past event based on the current context, which we call “context-dependent memory retrieval.”
Existing deep RL architectures (DQN, DRQN) lack this ability. DQN uses a fixed number of past observations, which makes it hard for DQN to generalize to large-scale environments that involve deep partial observability. Although DRQN can take a variable number of observations into account, it has to retain all potentially important information in its LSTM memory cells and retrieve all the memory values to make a decision. This means that memory retrieval is not conditioned on the current context. MQN (our architecture) implements the idea of context-dependent memory retrieval by storing recent observations into the external memory and retrieves only some of them based on the context constructed from the current observation. RMQN considers past observations as well as the current observation using its recurrent controller to retrieve from memory. Finally, FRMQN takes into account previously retrieved memory as well as the history of observations to model the temporal context in retrieving memory.  Thus, our three different architectures, MQN, RMQN and FRMQN, use increasingly richer information from past observations through recurrent connections to construct the temporal context that is used for memory retrieval.

We will revise our paper so that the above motivation of our architectures is well described in the architecture section.

[R8]
We think that both the construction of RL tasks in the Minecraft domain and new deep RL architectures are our main contributions. We chose Minecraft because it is a much more flexible domain and allowed us to design tasks to systematically explore the interaction of partial observability, active perception, external memory in NN architectures, and extrapolation and interpolation across domain sizes and maps. Atari games do not allow us this kind of parametric flexibility. Nevertheless, we will explore the performance of our architectures in other domains including Atari games in future work. 

We observed that the performance of our architectures (MQN, RMQN, FRMQN) improves as the external memory size is increased during evaluation on random mazes. However, there was no further improvement when using more than a memory size of 30. We will further investigate the influence of memory size and include this discussion in the supplementary material.