Paper ID: 1228 Title: Model-Free Imitation Learning with Policy Optimization Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces a new model-free method for imitation learning, that directly learns a policy from expert trajectories rather than inferring the reward/cost function used by the expert and then running RL to find a policy. Avoiding solving the RL problem in the inner loop has the potential to dramatically reduce the computation and improve scalability over existing methods using an inverse RL approach. The paper contributes a new policy search method, extending an existing policy search method TRPO; a nice review of several existing approaches demonstrating how they fit into the papers theoretical framework; and an empirical demonstration of the potential scalability of the proposed approach. Clarity - Justification: The clarity of the text is quite good, the paper is well written and very polished. The technical content is mostly accessible for an RL researcher and more so for one who works within this specific sub-field. I found myself often referring to Kakade and Langford (2002), and I suggest revisiting that work to improve the clarity of your presentation. The following is a list of minor typos or improvements: In my view the notation and descriptions could be improved. The paper clearly builds on the notations of Schulman et al, who made use of several results from Kakade and Langford. I think Kakade and Langford notations---though subtly different---are much easier for an RL researcher, and they specifically mention when things are defined for notational connivence (e.g., the \gamma-discounted future state distribution). You have space in the text and I think if would improve things greatly. line 127 should be rephrased for clarity Is Syed et al's approach restricted to tabular domains? line 431 "high-quality steps"? Do you mean updates, backups? Perhaps a footnote reminding the reader what majorizing means line 513: \dot^c subscripts is redundant. It's just c subscripts I don't think it is accurate to call optimal planning "plain reinforcement learning" line 063-065: vague consider rewording. "generally" in what sense. Significance - Justification: I think this paper makes a solid contribution, and should be accepted. They introduce a new algorithm, that appears to remove one of the main burdens of current inverse RL algorithms. The experiments show promise but I have several questions for the authors that will help refine the final score on the paper. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The main contributions of the paper are straightforward, and thus I will focus on the open questions and potential missing pieces to both invite the authors to clarify and help improve the paper. My main area of concern is the lack of explicit theoretical justification and concerns about the empirical results. The paper as it stands in mostly algorithm, with no theoretical justifications or analysis of the proposed algorithm. In the related work section (p2 end) the paper criticizes prior work for not having local optimality guarantees. Perhaps you can clarify the guarantees of your method and formalize them. For example, is your objective convex? The experiments raise some questions. I agree that the "Waterworld" domain is potentially interesting, but using non-standard domains is a bit risky without extensive empirical investigation. I could not find other publications using this domain. Perhaps it is nearly trivial. For example a popular implementation of the octopus domain has 80-ish dimensions, but agents that ignore all but a few dimensions can achieve very good performance. This is especially a concern here where the only results presented are with 2 different versions of the proposed algorithm. The reader has no context with which to judge the result. There are two ways to fix this. (1) include small benchmark domains where competing approaches can be tested, and then finish with a large demonstration on the waterworld domain. Or (2) provide more evidence of difficulty of the waterworld domain, perhaps by trying harder to make baseline comparisons run on the domain. For example starting with smaller instances of waterworld and showing how competing approaches scale computation and in reward loss for increasing dimension. The current result is basically a proof of concept with no baseline. I would also invite the authors to provide a better description of the experimental setting (parameters) to improve reproducibility. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper suggests tackling imitation learning by direct policy optimization. The key suggestion is to avoid explicitly representing an inferred cost function, and instead to improve the learned policy with respect to a worst-case cost function evaluated directly from the data. Clarity - Justification: If you have four levels of subscripts upon subscripts, you are doing something wrong. I have not internalized the method enough to propose alternative symbols, but the readability of many equations is very poor. However, the overall structure of the narrative arc is clear and consistent, and the results section is fairly well-written. Significance - Justification: I find the main message of the paper appealing - one can achieve near-expert performance using TRPO and data alone. However, the only comparison is with a REINFORCE variant of the same main idea (model-free policy search), which is known to be inefficient. A comparison with other approaches to IL would have been a lot more illuminating. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The method presented in this paper is scalable and hence shows promise to be relevant to real-world applications. The main weakness of the paper in its current form is the limited comparison with other state-of-the-art imitation learning algorithms. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes two policy gradient based approaches for apprenticeship learning: REINFORCE-like and TRPO(trust region policy optimization)-like approaches through the development of unified view of apprenticeship learning. The cost function is assumed to be a linear model and it leads to a simple estimation of gradient using samples collected by current and expert policies. The experimental results show that TPRO-like approach outperforms REINFORCE-like approach. Clarity - Justification: Although the paper well reviews related existing apprenticeship learnig methods, the organization of paper does not help me understand the entire proposed method easily and its complexity. That is, the explanation of proposed methods are scattered in 3 pages and half (section 4) with lots of derivations of equations. It is not so clear what the final resulting algorithm is like. I would like to ask the authors to make the main point of proposed method clear with shorter explanation and add the psuedo algorithm and complexity of the algorithm for more concrete understanding. Significance - Justification: There is no comparison with existing apprenticeship learning methods in experimental demonstration. Especially, the author insists that existing approaches cannot be applied directly large state spaces and high-dimensional continuous environments in the abstract. I undersntand this observation is comming from the outcome of modeling policy with parameter \theta. However, proposed methods may need large number of basis functions for good approximation of both policy and cost functions in large state-space and high-dimensional continuous action-space as well. I would like to ask the authors to add the experimental comparison with appropriate existing apprenticeship learning methods. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1) Vector is not expressed differently from scalar in this paper. It should lead readers to misunderstanding. Please express vectors with bold fonts. 2) If I am the author of this paper, I focus on either REINFORCE or TPRO-like apprenticeship learning method and use more space for qualitative and quantitative comparison with existing methods to make the motivation clear. =====