We would like to thank the reviewers for the constructive and helpful feedback.$ R5: We thank R5 for pointing us to exciting prior work. We will gladly discuss these references in the final draft. Nuygen et. al. (1989), Schmidhuber (1990, 1991), Golye et. al. (2004) are very insightful pioneering work in model-based RL using neural networks. They provide valuable insights on sequential vs parallel training, a variant of model-predictive control (as in Golye et. al. 2004, which resembles our experiment of using iLQG in MPC mode with fitted dynamics), on-line recurrent model and policy learning, and more. Schmidhuber (1991) also discusses the connection between the model network (+controller network) and adaptive heuristic critic (Barto et. al., 1983). However, these works focus primarily on model-based RL algorithms. We address cases where the model is approximate and imperfect, and therefore inadequate by itself for learning a good policy. Our paper focuses on combining model-free RL with model-based RL to reduce sample complexity, which we believe is an aim that is orthogonal and complementary to these prior methods. Koutnik et. al. (2013) proposes an appealing gradient-free method for learning high-dimensional policies with vision, but does not focus explicitly on reducing its sample complexity. The primary focus of our work is on reducing the sample complexity of model-free methods with model-based acceleration, which in principle could be applicable also to gradient-free methods such as Koutnik, though we focus on gradient-based Q learning methods in this work. We will include sample complexity in terms of episodes in the final draft (the reviewers can get sense in the submitted draft by converting each episode=150-300 steps). Given surveyed prior work and observed sample complexities, we reasonably claim that our approach is among the top in terms of sample complexity and generality for learning high-dimensional policies. R3: We refer to NAF as being simpler than actor-critic primarily because NAF optimizes a single objective, while actor-critic requires additional machinery, such as separately setting learning rates on the actor and critic. However, we agree that this claim requires more nuance, and we will revise it as suggested by R3. R4: R4 raised two interesting directions to explore: multi-modal NAF and NAF from perception. On multi-modal NAF, while a sum of quadratics is still quadratic, we have been experimenting on max of quadratic terms, which retains analytic computation of argmax_u and adds in multi-modality, unlike if we parametrize log of mixture of Gaussians. On perception, NAF is possible to be trained from images in the same way as DDPG. While we will attempt to investigate these extensions in the final draft, a complete and thorough investigation is likely to be outside the scope of the present paper and would be deferred to future work. Minor clarifications: (1) rewards in imagination rollouts come from evaluating the reward function using imagined states, (2) shadows in the figures indicate one standard deviation. [1] Barto, Andrew G., Richard S. Sutton, and Charles W. Anderson. "Neuronlike adaptive elements that can solve difficult learning control problems."Systems, Man and Cybernetics, IEEE Transactions on 5 (1983): 834-846. [2] Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust Region Policy Optimization. In Proceedings of The 32nd International Conference on Machine Learning (pp. 1889-1897).