Paper ID: 1322 Title: Model-Free Trajectory Optimization\\ for Reinforcement Learning Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose to use a policy iteration algorithm to optimize time-varying linear-Gaussian policies. The algorithm alternates between fitting time-varying local quadratic Q functions using dynamic programming and solving for the corresponding optimal time-varying linear-Gaussian policy under constraints on entropy reduction and KL-divergence against the prior policy. The method is evaluated on two benchmark tasks and one simulated robot task, and compared to a few prior methods. Clarity - Justification: The paper is well written and easy to follow. There are a couple of types and weird phrasing, but it didn't substantively impede my ability to understand the paper. Given the complexity of the material, that's quite good. Significance - Justification: The proposed algorithm has some very promising and nice ideas, but is specific to time-varying linear policies. That is a pretty restrictive policy class. It's not clear how broadly applicable that would be, but perhaps extensions to more complex policies are also possible in future work? The authors don't really discuss this unfortunately. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I found the paper relatively easy to follow and many of the design decisions were sound. The results reflect a good effort to compare to prior work, though as with many RL papers, the comparison is not on a standardized task (as far as I could tell?). It's hard to count that as a minus though, since it applies also to most other papers on this topic. I have two main concerns about the work: 1. The most technically interest part for me was the optimization for the policy with constraints. However, this optimization basically follows "Model-based relative entropy stochastic search" and extends it to a time-varying case. So the novelty is perhaps a bit limited, though the idea is interesting enough that I think it would be above the bar. 2. Perhaps more importantly, I'm deeply skeptical of the approach due to the large number of high-dimensional regression problems that need to be solved, combined with potentially ill-conditioned importance sampling. Dynamic programming on fitted Q functions for high-dimensional tasks is liable to suffer from pretty extreme bias and drift, and it seems like incorporating importance weights would exacerbate these issues. The authors seem to acknowledge that the bias exists, but don't say much about how they deal with it. It also seems that for importance weighting to be effective, the steps have to be quite small. This is perhaps reflected by the relatively large number of iterations used in the experiments (when accounting for the fact that the method is optimizing time-varying linear policies), suggesting the use of a relatively small step. I think that a sample complexity analysis of the method would point to some serious problems. The prior work (Model-based relative entropy...) used dimensionality reduction to deal with this, but that is not used here. So I'm not sure how well this would scale or how vulnerable it would be to IW degeneracy, and the choice of step size in that case appears really important. Some discussion of these points in the paper would be beneficial, particularly if there are certain best practices that could ameliorate these issues. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a policy improvement scheme for a trajectory-bound linear feedback policy using dynamic programming and transition data alone. Like other trajectory-optimization algorithms, the work here relies on a time-dependent quadratic approximation for the Q function, but here this quadratic approximation around the nominal trajectory is done without explicitly representing the linear approximation of the transition model. Clarity - Justification: Overall the paper is well written. I would have liked to see a more detailed derivation of eq. (4). Significance - Justification: Trajectory optimization is our best tool against the curse of dimensionality which plagues the advance of motor learning. This paper offers another method Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall this is a good paper, and I found myself genuinely engaged with the ideas presented here. For example, I was happy to see the weakness of dynamic motion primitives exposed (fig. 3b). One reason for concern is the fact that the initial policy was seeded by learning from demonstration, while other state-of-the-art work such as GPS do not require that; this could be due to the problematic domain of table tennis, or it may represent a more fundamental flaw in the exploration afforded by the algorithm, but this open issue can be explored by future work. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a method for data-based trajectory optimization which uses a both KL (to the previous policy) and entropy constraints to control the policy updates. Local time-dependent models of the Q function are learned instead local models of the dynamics as in GPS. The method is applied to a 4-bar swing-up problem and simulated table-tennis task. Clarity - Justification: Usually poorly written papers are substantively lacking. This seems to be a rare case of good research, poorly described. The most egregious deficiency is a short algorithm-style summary of the proposed method. Instead, it is spread throughout the entire paper which makes it very difficult to follow. For example, even though Section 4 is supposedly about sample efficiency, 4.1 describes the core setup of the supervised learning problem. The rest of section 4 is a grab-bag of ideas (all of them quite interesting) which are never summarized, and only some of which are used in the results section. In the results section the benchmark problems and general setup are not well described. What are the mechanical details of the 4-bar swing up problem? How was PGS implemented? Was Levine's publicly available codebase used, or was it re-written? Related work is described both in the intro and in (the strangely located) section 5 in very odd terms. The only citation from before the last 10 years is Bertsekas' textbook. Even the first sentence of the abstract "Trajectory optimization generally operates by iterative local optimization of the policy, alternating local approximation of the dynamics model and conservative policy update to ensure the stability of the update." Is a reasonable description of Levine's PGS but NOT of trajectory optimization in general which has absolutely nothing to do with learning or with policies and is a basic technique in optimal control, used in the 60's to land on the moon. The idea of using value functions alone (without dynamics models) for trajectory optimization was already proposed in [Todorov 2009 iterative local dynamic programming], but perhaps earlier, I didn't check the citations therein. Significance - Justification: This method shows a lot of promise but is very difficult to evaluate based on the results section. Some of this is not the author's fault but has to do with a lack of standardised benchmark problems. MOTO seems to do better than PGS on the swing up task, but not significantly so. It would be nice if a problem was described in which the proposed algorithm succeeded but PGS failed. It's very nice that MOTO can dispense with the DMP's for the the table tennis task. Two other issue impacting on significance are scalability concerns and the inherent restriction to episodic tasks. It would be nice if these issues were discussed. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I believe that this paper can be improved significantly with a major re-write, especially of Section 4. See above for details. =====