Paper ID: 885 Title: Asynchronous Methods for Deep Reinforcement Learning Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper explores the use of parallelization and asynchronous updates in accelerating the speed of convergence of deep Q-learning in the well known Atari games. It shows that asynchronous versions of standard algorithms (Q-learning, SARSA etc.) seem to improve performance in some games (but not all), and an asynchronous version of advantage learning seems to provide the best overall improvement. No theoretical analysis is provided, and the paper is largely an ad-hoc experimental study. Ultimately, the work is somewhat heuristic, resting on assumptions such as that "multiple actor-learners are likely to be exploring different parts of the environment". Clarity - Justification: The paper is written well, on the whole, although there is an excessive use of small graphs using colored curves, which would make this paper impossible to read in black and white print. The algorithms are described clearly, and the ideas are fairly simple. There is, unfortunately, a complete lack of theoretical analysis, which separates this work from the many studies of asynchronous updates both in the MDP dynamic programing/RL literature (Bertsekas, Tsitsiklis et al.) as well as parallel gradient methods (Hogwild). Significance - Justification: Ever since the original DQN paper was published, which used the simplest possible Q-learning RL algorithm combined with a very simple 2-layer convolutional neural net combined with a standard back propagation architecture to learn Atari games, it was obvious that many possible improvements could be made to this approach. Over the past year or so, not surprisingly, there have been dozens of such submissions, some interesting, and others not. This paper explores the case of using parallel asynchronous updates to improve the convergence of DQN. It has long been known that dynamic programming can be accelerated by asynchronous methods. A recent 2010 paper by Bertsekas and Yu (see http://www.mit.edu/~dimitrib/allerton_api_final.pdf) is a good example of everything this paper is not. It has a detailed convergence analyst showing the improvement of using asynchronous updates over synchronous updates, not on some chosen set of example MDPs, but over arbitrary problems. This is a long line of research, which dates back to Bertsekas (1982) cited in this paper. Other important work by Tsitsiklis was on asynchronous variants of Q-learning. Even in the case for parallel stochastic gradient updates, work like Hogwild cited here us based on not just experiments, but a clear and solid theoretical analysis. The lack of any such analysis makes this paper's significance weak. Therefore, the significance of this paper is largely the empirical study of whether this well known idea can be empirically implemented for deep RL and how much improvement it provides. Not surprisingly, the results are somewhat mixed. In Breakout, there is a clear improvement of all asynchronous methods. However, in Pong, there is not much improvement. Among the various methods studied, the advantage method appears to be the best, but it seems that there are two factors that are being conflated here, and sadly not teased apart. There are strong reasons to believe that advantage representations would work well in Atari, whether or not the updates are done in parallel through asynchronous means. This paper does not separate out these factors, making it hard to know how much improvement would occur using synchronous advantage updates. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): See above. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present a new approach to deep RL: instead of decorrelating temporal data via an experience database (the traditional approach), the authors decorrelate it by running multiple independent agents in parallel, and then looking at the data distribution across all agents at the same timestep. The authors turn this key idea into asynchronous, parallel versions of four RL algorithms, and compare against the state-of-the-art approaches. Experimental results suggest the approach is stable, robust, simple to implement, computationally efficient and that it performs better than other methods. Clarity - Justification: This paper was well-written. It was easy to follow, did a good job of motivating the key problem, detailing the proposed solution, and presenting an extensive empirical evaluation with informative, tidy figures. Significance - Justification: This paper sits somewhere between "Above average" and "Excellent." I hesitate to call this a "major breakthrough", primarily because it is mostly a natural combination of existing ideas (albeit effective!). On the other hand, I suspect that this will become the new state-of-the-art approach to deep RL. The advantages to the proposed approach cannot be overstated: the method opens up the entire on-policy toolkit to the world of deep RL; it is more computationally efficient than other methods; and the ability to incorporate mixed exploration strategies in a single policy rollout seems like a powerful weapon. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is an outstanding paper. It presents a clear, effective idea and tests it thoroughly. The authors go out of their way to compare to other state-of-the-art methods, and highlight many interesting directions for future work. I don't really have any criticisms; I can hardly ask for more than what this paper has given. I actually admire the authors for showing self-restraint: it would have been all too easy to do a cursory comparison to other methods and then move on to an investigation of new ideas, such as different network topologies, complex exploration policies, etc. Instead, this paper has laid convincing groundwork from which all of those explorations will flow. I consider that good science. A strong accept. Minor comments: I think it would be helpful to standardize the scaling of the y-axis in each column of Figures 3 and 4 (ie, same reward scale for each game). That would make it easier to compare different algorithms visually. Small typo: section 5.1.4 - "Picking up the agent led to" -> "Picking up the apple leads to" ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposed an asynchronous reinforcement learning method, where multiple actors (associated with different threads in a single machine) explore independent instances simultaneously and updates the model in an asynchronous manner. The method achieves better results in less time on a variety of tasks. Clarity - Justification: The paper is well written. Significance - Justification: See the comments below. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): It seems that the proposed method in this paper is similar to Gorila except that it is a single machine version and does not have experience replay memories for different threads. The method gets experience data from several asynchronous actors, which explores simultaneously. Since the model is updated in an asynchronous manner from multiple actors exploring different instances, the model is as if updated from multiple uncorrelated samples. Nevertheless, the next samples from the same thread still tend to be correlated with its own earlier samples. In this regard, the gradients received at the model parameter side are still periodically correlated. The experience replay memory is used to further remove this correlation. If Gorila is implemented in a single machine and experience replay memories are removed, then it will be very similar to the method proposed in this paper. Furthermore, the authors stated that putting multiple learners in a single machine removes the communication costs, which seems to be claimed as an advantage over Gorila framework. However, this criticism is not fair because multiple machines are indeed required for very large-scale models and problems. It is stated in the paper that gradients are accumulated over multiple steps before they are applied partly for avoiding overwriting other threads’ updates. Does this mean that the authors have not used atomic operations in implementing Hogwild! scheme? (From the original paper of Hogwild!, it seems that atomic operations are used.) =====