We thank the reviewers for their feedback.$
The reviewers had some doubts regarding the significance and novelty of our work. Let us address these issues. The superior performance of deep learning algorithms has been observed in many supervised, unsupervised and recently Reinforcement Learning (RL) benchmarks. However when a deep network is not performing well it is hard to understand the cause and even harder to find ways to improve it. Particularly in Deep RL, we lack the tools needed to analyze what an agent has learned and are therefore left with black box testing.  The goal of this paper was to gray the black box – understand better why DQNs work well in practice and suggest a methodology to interpret the learned policies. Let us provide some more details.

Understanding. One of the main reasons to use deep networks in high dimensional input RL problems is their potential to overcome the curse of dimensionality. While the ability of deep networks to automatically learn features for a given problem has been observed in supervised and unsupervised learning settings, we are unaware of works that explained the temporal structure of the data which is learned by DRL agents. In this work we showed that DQNs are learning temporal abstractions of the state space such as hierarchical state aggregation and options. Temporal abstractions were known to the RL community before as mostly manual tools to tackle the curse of dimensionality; however, we observed that a DQN is finding abstractions automatically. Thus, we believe that our analysis explain the success of DRL from a reinforcement learning research perspective.

Interpretability. We believe that Interpretation of policies learned by DRL agents is of particular importance by itself. First, it can help in the debugging process by giving the designer qualitative understanding of the learned policies. Second, there is a growing interest in applying DRL solutions to real-world problem such as autonomous driving and medicine. We believe that before we can reach that goal we will have to gain greater confidence on what the agent is learning. Lastly, understanding what is learned by DRL agents, can help designers develop better algorithms by suggesting solutions that address policy downsides. However this is beyond the scope of this work (Assigned Reviewer 1).

Novelty. We started our analysis by visualizing the state space using t-SNE maps similar to the ones produced by Mnih et al. 2015. We suggested a methodology to analyze the representation learned by the agents using this visualization, by manually clustering the states and filtering the map with different measures (e.g., Figure 3 in the supplementary for example). While we did not introduce a new technology for data visualization, we did identified hierarchical representations and showed the ability to interpret learned polices (Assigned Reviewer 1). We believe that both are novel ideas that will help other researchers gain a better understanding of DQN's and how to train them. Our analysis is indeed manual and problem specific. We believe that designing universal algorithms is of great importance; however, when an algorithm is not performing as expected (e.g., Seaquest), it is important to understand the reasons (Assigned Reviewer 1&amp;4). 

Response to individual reviewers:

Reward clipping and target network (Assigned Reviewer 1) – we accept the comments regarding the reward clipping and setting the target to zero when the next state is terminal.  We will correct the text in the paper.

t-SNE colors (Assigned Reviewer 4&amp;5) – We agree that a color bar indicating the high and low values will better explain the t-SNE visualization. In all of our experiments red color indicates high values while blue indicates low ones. We will correct the Figures in the paper.

t-SNE (Assigned Reviewer 5) – We thank the reviewer for his suggestions to improve the clarity of the t-SNE visualization method. Using PCA as a preprocessing stage before applying t-SNE on data has been suggested by the authors of the t-SNE paper as a way to reduce t-SNE run time and used in the original paper experiments. Please refer to the t-SNE paper (Van der Maaten &amp; Hinton 2008.) for more details.