Dear reviewers,$ Thanks for all your feedback and suggestions. We will carefully incorporate them into our paper. As noted by the reviewers, two major contributions of this work are the re-implementation and extensive comparison of a wide range of algorithms, as well as a set of challenging control tasks we made publicly available. Beyond the paper, it is our ongoing work to implement and evaluate more algorithms as they are proposed, as well as adding new tasks to the benchmark suite. Assigned_Reviewer_4 Re: Compatibility We are working on an http interface for communicating with the simulation from other languages. The benchmark will be released under a BSD licence. Re: More tasks We plan to include more tasks including manipulation to the benchmark. We will use github for hosting the code and will help facilitate contributions from the community. Re: Are all tasks solvable? In developing the hierarchical tasks, we also implemented a point robot controllable using arrow keys and verified that the tasks are solvable by manual control. Re: Baseline We use a hand-tuned feature representation, which consists of polynomial terms of the observations as well as an entry to indicate the time step. We found empirically that this outperforms the neural network baselines, and hence we used it for REPS as well. We will explicitly mention the used feature vector in the updated paper. Re: Are tasks comparable in scaling? We normalize the empirical returns in each iteration to have mean 0 and variance 1, which makes the algorithms relatively robust to reward scaling. Re: Additional performance metrics We agree that more metrics will make the evaluation more informative. We are also working on a public scoreboard that makes the raw learning progress data available. Re: Parameter settings We included these in the supplemental material. Re: RNN results We fixed the numerical issues for RNNs after the submission. We only used features of the current observation to fit the baseline, but incorporating past observations may boost the performance further. Re: Fully vs partially observed (PO) results The PO tasks were executed with a shorter horizon (100 vs. 500) compared to the fully observable tasks in order to make training faster. This roughly means that the scores for PO tasks are ⅕ of the fully observable ones. Re: REPS stationarity The REPS problem formulation specifically assumes the existence of stationary state distribution, see [Peters-10AAAI] Eq 5-8. Assigned_Reviewer_5 Since submission we have added DDPG [1]. We are continuing to add more, including Gprop [2], GPS [3], SVG [Heess et al], A3C [Mnih et al]. Although only one of the methods in the original submission was proposed after the rise of deep learning, there is a long history of RL with neural networks (e.g., Sutton99NIPS, Ng2000ICML). Algorithms designed with deep neural networks in mind continue to be proposed as well as novel tasks that reflect challenges in deep RL for continuous control, and we are committed to continue to incorporate them into the benchmark (ourselves, and/or in collaboration with the authors of new approaches), which we hope will help facilitate understanding of the relative merits of new approaches. Assigned_Reviewer_6 Re: Preliminaries: finite horizon We agree that the tasks should be independent of these considerations. Here, the horizon and discount factor are technical assumptions made by the algorithms rather than the tasks. Re: Hierarchical tasks We appreciate and will cite the provided references in our next version. We will expand the explanation that motivates these tasks, explaining that decisions are made at multiple different temporal scales. Re: REINFORCE / VPG We will refer to it as REINFORCE in the next version. For the baseline, we use a linear basis function which approximates the value function using Monte Carlo estimates. We plan to include actor-critic methods in the future. Re: Performance metric: (un)discounted? As noted above, here the discount is specific to the algorithm but not the task, typically for variance reduction. Using the undiscounted return allows us to compare the result under potentially different discounts. Re: Deep policies The network architecture and the log std parameterization follows [Schulman16ICLR]. Comparing deeper architectures would be an interesting future direction. Re: Code suggestions We will avoid unnecessary components in the code. We also plan to separate RL and simulation code into two modular components.