Paper ID: 1274 Title: Continuous Deep Q-Learning with Model-based Acceleration Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces a method for applying Q-learning to problems with continuous states and actions while using neural networks for function approximation. The main tricks that make this possible are choosing to represent the Q-function in terms of a value function and advantage function, and parametrizing the value function so that it's quadratic w.r.t. the action. Directly estimating the mean and covariance of the quadratic advantage function using a deep network makes it possible to evaluate the max of the Q-function using a simple forward pass through the value and advantage function networks. This makes Q-learning practically feasible. The authors also describe an approach to generating synthetic experience using policy rollouts based on a learned dynamics model (a la Dyna-Q). They show that this can accelerate the early, bumbling stages of training so long as the learned dynamics are sufficiently accurate. Clarity - Justification: The paper was easy to read, while providing sufficient technical detail (i.e., equations) to make the mechanisms behind the proposed method clear. I'm happy that the authors did not fill columns with linear algebra when mentioning iLQG or when describing their chosen approach to fitting adaptive local dynamics models. Significance - Justification: This paper is largely built around four ideas: using a value/advantage representation of Q(s,a), restricting A(s,a) to be quadratic w.r.t. 'a' and parametrized by a neural net, synthesizing experience using policy rollouts based on a learned dynamics model, and using a locally-adapted linear-Gaussian dynamics model. Of these, the least novel are the value/advantage decomposition of Q(s,a) and the use of locally-adapted linear-Gaussian dynamics. Training on synthetic experience generated by policy rollouts using learned dynamics has been suggested often, most notably with Dyna-Q and the overall Dyna architecture, but this paper seems to be the first time it has been shown effective in a (relatively) realistic setting with continuous states/actions and non-linear function approximation. The experiments probing the importance of the dynamics model's fidelity are also interesting. The most novel part of the paper is the representation of the advantage function using a quadratic whose mean and covariance are the outputs of a neural network. This makes practical deep Q-learning possible in the continuous state/action setting. It's not a panacea, as the quadratic assumption is restrictive and has unwanted side-effects, but it's a strong contribution. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I liked this paper. It provides an interesting alternative to actor-critic type methods that might otherwise be used for the problems it considers. The proposed approach outperforms existing alternatives (though there aren't many -- yet) on most of the tested problems, sometimes significantly. The method doesn't look too difficult to implement. I have a minor quibble with the authors' assertion(s) that this approach is significantly simpler than comparable actor-critic methods. Namely, the proposed approach requires estimators for V(s), mu(s), and P(s). Actor-critic methods require estimators for Q(s,a) and pi(s). Three is more than two (IMHO). While there is apparently some parameter sharing going on in the authors' models, it would be feasible to share many of the "state representation" parameters between Q(s,a) and pi(s) in an actor-critic method too. The main advantages of the proposed method over comparable actor-critic methods seem to be that the quadratic assumption acts as a regularizer and that the learning signal for all of the estimators comes directly from the environment. In comparable actor-critic methods the learning signal for the policy comes indirectly through Q(s,a), which is unlikely to be a good approximation early in training and may have wacky gradients if it's a big deep network (see, e.g., papers on adversarial examples for convnets). I may have missed it, but where does the reward signal come from during the imagination rollouts? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents two contributions to the field. First, it presents a new parametrization of the Q value function, known as the Normalized Advantage Function (NAF), which enables efficient training using the Q-learning procedure with experience replay. Second, it presents a new type of model-guided exploration, known as Imagination Rollouts, where a dynamics model is used to generate both negative positive trajectories which are added to the experience replay buffer. The authors demonstrate excellent performance compared to Lillicrap et al. (2016) on a set of simulated robotic tasks, and investigate in detail how NAF and Imagination Rollouts compare to existing approaches. Clarity - Justification: The paper is clear and written well. Significance - Justification: The two contributions given by the paper are novel, mathematically sound and yield substantial improvements on the investigated tasks. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Defining the advantage function as being quadratic assumes that the state only has one "optimal" action (or, that all good actions are centred around the same mode), but for many tasks several different actions may be "optimal". One simple extension would be to define the advantage function as a sum of k quadratic functions. For Q-learning this would require k evaluations. Have you experimented with this? Lillicrap et al. (2016) also managed to train effective policies when the observation space is given as pixels alone (instead of joints and positions). Is it possible to train NAF on pixels alone? How does it perform to DDPG in this case? The main benefit of the proposed Imagination Rollouts procedure is its ability to use dynamics models efficiently. It would have been nice to see results on other, more complex tasks where such models are available. Other comments: - All figures: what do the semi-transparent lines and colors represent? Standard deviations? Confidence intervals? - Figure 4 and Figure 5: Change “iLQG-x” -> “ilqg-x” - Define "MPC" in the main text ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The goal is to bring the generality of model-free deep reinforcement learning into real-world domains by reducing sample complexity. A normalized advantage function representation allows for applying Q-learning with experience replay to continuous tasks. Learned representations are shared between Q function and policy. Iteratively refitted local linear models demonstrate faster learning in certain domains. Clarity - Justification: Well written, apart from quibbles mentioned in Section 5 Significance - Justification: The basic approach and the experiments are interesting, but relations to previous work are missing, and it's not clear that the new system will work better than some of the systems described in this previous work, so I am hesitant to label this as "above average" Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I have quite a few comments on missing related work: The authors refer to Dyna-Q (Sutton, 1990) for discrete domains, and introduce a system for continuous domains. However, back in 1990 there already was another model-learning RL system for continuous domains, even with recurrent networks for partially observable worlds - what's the relation to this work?: J. Schmidhuber. An on-line algorithm for dynamic reinforcement learning and planning in reactive environments. In Proc. IEEE/INNS International Joint Conference on Neural Networks, San Diego, volume 2, pages 253-258, 1990. J. Schmidhuber. Reinforcement learning in Markovian and non-Markovian environments. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, NIPS'3, pages 500-506. San Mateo, CA: Morgan Kaufmann, 1991. This extends earlier work on RL with feedforward NN models, e.g.: Werbos, P. J. (1989a). Backpropagation and neurocontrol: A review and prospectus. In IEEE/INNS IJCNN, Washington, D.C., volume 1, pages 209–216. Nguyen, N. and Widrow, B. (1989). The truck backer-upper: An example of self learning in neural networks. IJCNN, pages 357–363. One should mention this line of research, and point out what's different here. l 066: "This makes it possible to train policies for complex tasks with minimal feature and policy engineering, using the raw state representation directly as input to the neural network." The first paper on this (on RL with high-dimensional raw vision) used a type of RL that is neither discussed in the related work section nor cited by the authors, namely, compressed network search - what is the relation to this?: J. Koutnik, G. Cuccu, J. Schmidhuber, F. Gomez. Evolving Large-Scale Neural Networks for Vision-Based Reinforcement Learning. GECCO, Amsterdam, July 2013. l 145: "However, we demonstrate that iteratively fitting local linear models to the latest batch of on-policy or off-policy rollouts provides sufficient local accuracy to achieve substantial improvement using short imagination rollouts in the vicinity of the real-world samples." How is this related to the robocup-winning RL system of 2004, which also used short imagination rollouts, e.g.: Gloye, A., Wiesel, F., Tenchio, O., Simon, M. Reinforcing the Driving Quality of Soccer Playing Robots by Anticipation, IT - Information Technology, vol. 47, nr. 5, Oldenbourg, 2005. l 483: "However, it should be noted that imagination rollouts can suffer from severe bias when the learned model is inaccurate. For example, we found it very difficult to train nonlinear neural network models for the dynamics that would actually improve the efficiency of Q-learning when used for imagination rollouts." I think this was already noted in the early 1990s - not even sure who was first to note this. The experiments are interesting, but it's not clear that the new system will work better than some of the systems described in missing previous work, so I am left with a inconclusive feeling here. =====