Paper ID: 629
Title: Benchmarking Deep Reinforcement Learning for Continuous Control

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces a testbed for assessing the quality of Deep Reinforcement Learning algorithms on continuous state and action spaces. The testbed includes a set of simple swing-up/balancing tasks, a set of locomotion tasks with increasingly complex robots, a set of partially observable tasks obtained by transforming the locomotion tasks (e.g. adding noise to the sensors or taking away the velocity information) and a set of hierarchical tasks obtained by adding extra objectives to the locomotion tasks (gathering food or reaching a target area). Within this testbed, a single Deep RL algorithm is compared to other miscellaneous approaches.

Clarity - Justification: The paper was a pleasant read.

Significance - Justification: No new algorithm is introduced and the only Deep RL algorithm considered in the paper was already evaluated on similar locomotion tasks. Probably the most novel part is the hierarchical tasks but very little is done with these experiments. As a result, no significant insight is gained upon reading the paper.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The idea of a testbed for systematic comparison of Deep RL algorithms is sound and should be encouraged. However, out of all the methods cited in the introduction, only a single RL algorithm that was developed with Deep Networks in mind is evaluated on the testbed. With such an omission it becomes hard to assess whether the selected tasks are relevant and are able to highlight the trade-offs that may occur with Deep RL algorithms. Recently, several more deep RL algorithms for control have been proposed, for example, see [1,2,3]. I think a comparison to these papers would make the paper much stronger.  [1] Lillicrap et al, Continuous control with deep reinforcement learning [2] Balduzzi et al, Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies [3] S. Levine at al, Several guided policy search papers...

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes an integrated collection of continuous state and actions environments for benchmarking reinforcement learning algorithms. While other toolkits have already been proposed (pybrain, RLpy, RL-glue), this one aims to be the "continuous" equivalent of the Arcade Learning Environment (ALE). Therefore, it only focuses on providing fast implementations of the classical control problems (such as cartpole, mountain car) as well as the more recent ones (3D environments).  Another contribution of this paper is to provide an extensive comparison of policy search methods on these environments. 

Clarity - Justification: The description of the algorithms is concise but remains accurate. However, I have a few objections (explained below) with the presentation of REINFORCE and policy gradient methods. The definition of the baseline function is also unclear to me. 

Significance - Justification: Despite not proposing any new theory, this work could still have a great impact in our community. In supervised learning, "standard datasets" can easily be downloaded and applied readily to a newly developed algorithms. However,  RL experiments tend to be difficult to produce since most of the development time has to be dedicated to physics simulation. This often contrasts with the simplicity of RL algorithms which can easily be implemented in a few lines of code.  A comparison of policy search methods was much need in the field and already provides some new research directions to improve existing work. For example, I found surprising to read that the simple REINFORCE method compares favorably to more complicated methods. It is also the first time that I see an evaluation of REPS against other methods. The authors also found out that it tends to converge prematurely. This provides some useful research pointers.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): First paragraph, superfluous new line.  In the preliminaries: I don't see why you need to assume the finite horizon case. I understand that much of the current robotics research assumes that setting but I think that the proposed set of environments is independent of these considerations. The presence of the discount factor in this definition also already provides an implicit "finite horizon".  Hierarchical tasks: MAXQ is important but I would also cite Options (Sutton, Precup, Singh) and and hierarchical abstract machines (HAMS) by Ron Parr and Stuart Russell.  It would be useful if you could explain better why these tasks exhibit a hierarchical structures. Locomotion + food collection naturally has two concurrent goals but why are these hierarchical ? What could be an other example ?  Vanilla versus natural: I think that it would be preferable to simply call this method "REINFORCE" or the likelihood ratio method. In my understanding, the qualifier "vanilla" originates from Jan Peter's work. I find that it carries a negative connotation and it doesn't help in describing the nature of the algorithm.  You also write policy gradient / "slash" REINFORCE. These two results are related but certainly not equivalent. In fact, when I read "VPG" I expect to see a description of how a "critic" is learned (say by using TD). Since the variant that you use is based on the Monte Carlo return, the method really should be called REINFORCE. If you have time, I would greatly appreciate if you could also include an additional column in your table where you compare PG methods with a learned critic.  TRPO: when explaining the definition of $A_\theta(s, a)$, you could also mention that it's the "advantage function".  Performance metric: since you assumed a discounted setting in the preliminaries, I'm wondering if using the undiscounted return is the appropriate metric.  Deep policies: I understand that for the sake of comparing algorithms, 3 layers neural networks might be sufficient. However, I don't think that it qualifies as "deep" nowadays. If you had space, it would be useful if you could write the policy representation explicitly. Also, why the log standard deviation ?  Baseline: In PG methods, it is common to use the value function as a baseline. However, since you don't learn a critic I'm not sure what baseline you are using. Are you learning this baseline with the empirical variance as an objective ?  # Code   It looks well written. However, in a final release it might be preferable if  you separate the simulation code from the RL algorithms. As an RL researcher, the only think which I care about is easily importing a set of environments.  I have tried to run examples/trpo_cartpole.py. However, since the code is written  in Python 2, I tried to convert it with "2to3" but didn't succeed because of the use of "ord()" in rllab/misc/overrides.py. The "overrides" pattern is nice of a language like Java but I don't think that it's a very Pythonic thing to do. overrides.py seem to contain many tricks and I think that it would be preferable not to  implement such a pattern since it adds unnecessary complications. 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a suite of continuous control benchmark problems, including low-dimensional classical ones (such as cart-pole balancing), locomotion problems of varying dimensionality and difficulty (including high-dimensional humanoids), partially observed tasks of different variety, and two sets of “hierarchical” tasks where locomoting creatures need to gather food or navigate a maze. Several published RL algorithms are evaluated on the proposed tasks. The paper is accompanied by a reference implementation of the benchmark tasks (mostly reliant on the physics engine Mujoco) and of the tested algorithms. 

Clarity - Justification: The paper is very clearly written. Some of the details regarding the tasks and regarding the implementations of the algorithms need, however, be inferred from the source code. 

Significance - Justification: A good suite of diverse continuous control benchmark tasks and a reference implementation of several state of the art and of some older algorithms will be very valuable for the community. It will lower the barrier for researchers to enter this field and it will facilitate the comparison of new algorithms.  

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): If the code that is provided as supplemental material lives up to what the paper promises (I haven’t checked it in great detail, only briefly browsed the website -- which looks good) then I think the proposed benchmark suite and the provided reference implementation of algorithms will be highly valuable for the community. Developing challenging but solvable continuous control problems is a very time consuming and difficult task, and so is the re-implementation of published algorithms. To my knowledge there is currently no widely accepted benchmark suite of similar diversity available.  In my eyes, the main value of the paper derives from the reference implementation. It’s practical value will greatly facilitate scientific progress so I’m very much in favor of publicizing it. At the same time, the paper does not propose any new algorithms and except for the hierarchical ones most of the tasks are not novel but taken from published papers. Also, relative exhaustive comparisons of several algorithms on subsets of the tasks proposed here (or very similar ones) have previously been published as part of papers developing new algorithms (eg. Levine & Abeel, 2014: Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics; Schulman et al., 2015: Trust region policy optimization).) There is some novel insight to be gained regarding the behavior of the different algorithms from the experimental evaluation in the paper but I think this is somewhat limited. To me this raises a little bit the question whether an ICML paper is the appropriate format for publication, considering also that a conference paper review is an inadequate way to test the practical usability of the benchmark suite. But I leave this for the PC to decide; I am in favor of it.   Further detailed comments:  **Compatibility: From my brief browse of the accompanying website my impression is that the benchmark suite is tightly tied to python. How easy would it be to run algorithms written in other languages against this suite? Especially if the tasks are effectively specified in the form of python code rather than e.g. some declarative form such as an XML file then I think it is important to provide easy to use interfaces for other languages. But I think it would be good to provide a non-code-based full specification of all tasks (including initial conditions, episode lengths, etc.) as well.  Under what licence will the benchmark suite be available?  **Tasks:  Although I like the diversity of tasks included in the suite, it seems that one important family of tasks is missing: manipulation (reaching, grasping, …). Tasks of this kind would be good to add. Do you have plans in this respect?  More generally, what are your plans for including new tasks developed by the community in the benchmark suite?   For those tasks where none of the algorithms achieves satisfactory performance: have you ensured that they are actually solvable / sensible?  **Algorithms: Re REPS: What are you using as features phi? It seems that this algorithm is the only one that uses a value function effectively as a critic rather than a baseline; hence this choice of features should be much more important that the choice of the baseline for the other tasks. Or am I missing something?   Baseline: What specifically did you use as feature vector for the baseline? Simply a vector encoding the time step? Or some nonlinear features of the state plus the time step? Too simple a baseline could unduly affect the performance of some of the algorithms more than others, I suppose?   ** Experimental Setup:  You propose to optimize the hyper-parameters for each algorithm on a subset of tasks in each family and to then apply the algorithm with the thus found hyper-parameters to the remaining tasks the family. I agree that it is important to test algorithms for their robustness to hyper-parameter choices. But for the above procedure to be reasonable I suppose one wants to ensure that the different tasks are roughly comparable e.g. in terms of scaling of rewards, #of timesteps per episode etc.. Has any effort been made in this respect?  The average return over all training episodes is used as the main performance measure. While this rewards both fast learning and high final performance it also makes e.g. a quickly achieved mediocre performance indistinguishable from a more slowly achieved high final performance. Maybe it would be good to add a second score that takes this into account? Similarly it may be useful to report the average performance as well as the performance of the best run.  I couldn’t find information about the number of rollouts per iteration (N), and the length of the episodes anywhere - what were these? (Apologies if I have missed something.)  ** Results  I am little bit surprised by the poor performance on the PO tasks. Have you tried simple tricks commonly applied with RNNs such as gradient clipping? Also, what was your baseline for these tasks: a recurrent network or again just a feature vector of the current observation?  Why does REPS suffer more from the assumption of stationarity than the other algorithms?   Why does there seem to be a difference between the performance of the random policy for fully observed vs. partially observed versions of the basic tasks? (E.g. for mountain car, acrobot?)  ** Other comments:  - line 437: superscript M?  - Fig. 4c: the legend says function of NPG and TRPO in function of the mean KL-divergence. Isn’t the y-axis here the return?  - formatting of the references in the text is odd (first & last author) 

=====