We thank all reviewers for their feedback and suggestions and we will incorporate them to improve the paper.$
Assigned reviewer 4:

The reviewer seems to focus on one of the four algorithms (1-step Q learning) that we have presented in the paper, with the others being multi-step Q learning, SARSA and actor-critic. In that respect we respectfully do not agree that this work is a simple extension of DQN. We believe this work presents a very accessible framework for applying RL to challenging problems at scale without significant resource requirements.

Recently, massive progress in RL has been made using richer 'deep' function approximation schemes for which convergence guarantees are not typically available. Indeed excellent theoretical work has been provided for e.g. asynchronous dynamic programming, but these provide no convergence guarantees when using nonlinear function approximators, therefore empirical studies are the best current tool for understanding stability.

Training neural network controllers using reinforcement learning has been considered to be prone to instabilities. We presented a simple framework in which variants of four RL algorithms were able to successfully train neural networks without the use of experience replay (which was used by DQN to stabilize Q-learning with neural nets). We believe that showing how a wide variety of methods like Q-learning, SARSA, and actor-critic can be used to train neural networks is a significant contribution that will open many possibilities for future research.

Our results are far from mixed. All four methods scaled roughly linearly or superlinearly with the number of parallel actor learners and all four methods were stable. The best of the methods achieved state of the art results on the Atari domain using a single machine without a GPU and half the training time of the previous state of the art.

Assigned reviewer 5:
It’s certainly true that some problems require models that can’t fit on a single machine and we believe that many of our findings will transfer directly to the massively distributed setting used by Gorila. Nevertheless, the single machine setting used in this paper can be used to train relatively large models since it easily scaled up to challenging problems like 3D maze navigation with convolutional LSTM networks. Moreover, we believe it is very important to consider different algorithms as we did in this work (Q-learning, SARSA, A3C) before committing to a large scale implementation.