Paper ID: 927 Title: Dueling Network Architectures for Deep Reinforcement Learning Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): In this paper, authors propose an alternative architecture and associated learning scheme for Deep Q-Networks (DQNs), based on the idea of modeling separately the state value and actions' advantages. When applied to the Atari Learning Environment benchmark, this technique advances significantly the state-of-the-art. Clarity - Justification: The paper is very well written and easy to understand. The appendix shows the algorithm and detailed experimental results. I only have the following minor remarks: - The introduction could briefly introduce the concept of action advantage, for those not familiar with it, instead of having to wait until eq. 3 - l.223: "The agent seeks maximize": missing "to" - l.584: "consists 64 3x3 filters": missing "of" - In experiments the text says "up to 30 no-op actions", but Table 1 just says "30 no-ops", and so does Table 2 in the appendix. Please keep "up to" everywhere if that is the correct version. - Table 1 is referenced in l.656 right after the definition of the improvement over the baseline. This is confusing because as far as I can tell Table 1 only compares to human performance and thus uses a different formula. Please make sure it is clear which formula is used in Table 1. Also l.656 should say to look only at left column since the right column is explained later in the text. Significance - Justification: This paper advances in a meaningful way a hot area of research. Although it is "just" a slight modification to the architecture and training of DQNs, the significant improvements it brings must definitely be shared with the community. I expect it will motivate further research into the optimal representation of Q value approximators. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Other miscellaneous comments and questions: - Is epsilon (l.514) really set to such a small value? (0.001) Since I guess the optimal policy is to go to the top-right corner as fast as possible, and there are only ~70 steps to get there, isn't the behavior policy just going to introduce only 0-1 random actions in most runs? - l.521, just to be sure, this is a uniform sum? (not weighted by state / action probabilities from the behavior policy for instance) - Figures 4-5 show it does not always improve over the baseline. I realize it is expected due to the large amount of variety among games, but it would have been interesting to investigate whether games where performance decreases share some common patterns, to better understand potential flaws in the proposed method. Do you have any interesting insight / discussion to be added on that topic? - l.845 "the values for all other actions remain untouched": why would that be? A single update will change the internal state representation and thus modify predictions for all actions (in general) ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Deep Q learning is promising, emerging topic in machine learning. Using deep learning techniques to approximate the Q function, these methods have caught wide attention for beating humans at Atari games and more recently (in combination with Monte Carlo tree search) beating a world champion human Go player. This paper suggests that until now the neural networks used to perform this Q function approximation have been standard neural networks (CNNs, MLPs, and LSTMs). The authors suggest that the task of Q function approximation can benefit from an architecture specially designed for reinforcement learning. Specifically they propose an architecture where by the model decomposes the Q function into a value function V(s) and advantage functions A(s,a) for each action in that state. They demonstrate the utility of this approach on the Atari game tasks. For comparison to existing methods they start with an otherwise identical convnet to that previously used for Atari game play. However, following the convolutional layers, they fork the network into two streams of fully connected feed-forward computation. One results in a scalar which represents V(s) and the other represents A(s,a). These are combined by giving final output Q(s,a_ V(s) + A(s,a) - avg(A(s,a)). This prevents the net from arbitrarily shifting value between V and A(s,a). Experimental results are shown on both a toy task called corridor and on the full set of Atari games. Additionally, the authors present an impressive visualization of the model's capabilities with a video, taking pains to post video anonymously so as not to break. anonymity. The novel approach outperforms the baseline on the toy task and on the majority of Atari games. The contribution is simple but effective and the results are impressive. Clarity - Justification: Overall, the paper is very well written. As someone knowledgable but not expert on the latest developments in deep reinforcement learning, I found the presentation of prior work clear and enjoyable to read. The presentation of the results was clear, and the video demonstration was interesting (although I would like to know what action is being evaluated for saliency on the right half of the screen). The one significant (but trivial to fix) complaint I have about the presentation is the title. What precisely is meant by "Dueling network architecture"? Specifically the word "duel" refers to a combat between two individuals. The title of this paper appeared to suggest that it was about reinforcement learners pitted against each other (as they are in Go). But it appears from my reading of this paper that the networks are playing single player games (pitted against the game, but not against another net). Thus there is no dueling between reinforcement learners, is there? If I'm not mistaken then this title might be misguided. When the architecture is presented. it is composed of two streams, in contrast to the single stream network. Is it that you mean DUAL-stream network architectures? If so, then "dueling" is an incorrect and misleading name for this contribution. You would then mean "dual" (as opposed to "single"). Significance - Justification: To my knowledge, this is a novel contribution. There is nothing particularly complicated about the machinery, but I view this as a virtue rather than a vice. It has been demonstrated many times that even slightly better approximations to the Q function yield significant improvements in reinforcement learners, so reinforcement-learning specific architectures to improve the Q function approximation seems a natural approach to take. The improved results in comparison to prior work are significant, achieving state of the art performance on many benchmarks in a task that (while not intrinsically important or impactful) is currently the most studied test-bed in this burgeoning topic. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall this is a nice solid contribution. The writing is clear and unpretentious, the prior work is informative and the results are impressive. I think it is a sufficiently novel and effective contribution to warrant inclusion in this year's ICML proceedings. While I posted the bulk of my specific feedback above, I have a few additional questions/comments: 1) Can you elaborate on the role of the discount factor? In a video game setting does one really have a tradeoff of current vs future points? If not, does that mean that you always treat the discount rate as 1.0? 2) Why is performance worse on some games? Is there anything that these games have in common? Can you provide some insight? 3) This model appears to have a little bit of extra capacity as compared to the single stream model. How much of a benefit could you get just by increasing the capacity of the single stream model with an extra hidden layer? Can you separate these sources of improvement? Besides the serious question I have about the choice of title, I came across several typos in the course of reading: 256: "the the" - duplicate 354: "collision is eminent" - you mean "imminent", unless this is a particularly famous or important collision. 649: "For example, an agent that achieves 2% human performance should not be interpreted as two times better when the baseline agent achieves 1% performance" - I think you mean "better than", not "better" ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Inspired by work in reinforcement learning algorithms using advantage functions, the paper proposes a neural network, which de-couples the state value function from the state-action value function while sharing features between the two functions. The paper uses the new architecture to demonstrate new state-of-the-art results on playing Atari games, while providing a thorough comparison to previous models for playing Atari games. Clarity - Justification: Paper is well written and very easy to follow. There are almost no typos or mistakes. Significance - Justification: Although the general idea of learning an advantage function and de-coupling the learning problem into learning two function approximators has been investigated previously in the literature, as the paper points out, the paper still provides a substantial contribution to the field. The main contribution of the paper is to demonstrate that the idea can successfully be combined with neural networks and that it carries over to the Atari games domain, while providing technical details of the implementation, which should make it easier to transfer the idea to other domains. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Can you discuss why the new model does poorly on the bottom set of games in Figure 4 and Figure 5? Games like "Kangaroo", "Freeway" and "Seaquest" all require long-term planning at times. Is this related to the poor performance of the new model? Could it be that during training the model converges to a policy with short-term strategies quickly, and afterwards has difficulty exploring longer term strategies? In this task the actions are discrete. Have you thought about how to modify this network to accommodate continuous actions? Other comments: - Line 223: "seeks" -> "seeks to" - Line 255: "measures the how good" -> "measures how good" - Line 256: "the the" -> "the" - Line 268: Rephrase. Perhaps remove "at iteration $i$" and add after the equation: "where $i$ is the iteration and" - Line 286: Equal sign should be proportional sign. - Line 299: "instead" -> "instead of" - Lines 307-312: Doesn't experience replay also reduce bias, since the samples are closer to being "independent"? Isn't this more important than the reduction in variance? - Line 332: "in" -> "is". - Lines 385-389: This is redundant. Consider to remove or delete these. - Line 441: Remove "by a constant" - Lines 637-638: Explain a bit more what "30 no-op" means. - Line 862: "difference in scales can lead to small" -> "difference in scales and small" =====