Paper ID: 30 Title: Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present an algorithm that incorporates cost learning into the inner (policy learning) loop of IOC and learns a non-linear continuous cost function using a neural network with two novel regularization methods under unknown dynamics. Clarity - Justification: The paper is mostly well written and motivated. However, there are a few areas that are unclear which are outlined below. In addition, the use of deep networks in the title when a shallow network is used in the applications is disappointing. Significance - Justification: The paper presents an interesting solution to the problem of learning complex non-linear cost functions in high-dimensional spaces with unknown dynamics. Of particular interest are the regularization methods that do not seem to be constrained to the proposed method and would be useful to the other IOC algorithms. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In 'Guided Cost Learning section': A better description on the difference between the proposed method and the method presented in (Levine and Abbeel 2014) would be helpful. What new problem is being addressed that prevents the previous method from being used? In 'Cost optimization and importance weights': This section was hard to follow, specifically the first paragraph that has a typo on line 444. It is unclear why the demonstrated batch needs to be added to the background batch. In algorithm 2: Should sample batch be changed to background batch in line 4? Some citations that could be added: It seems that there should be a reference to deep Q learning ('Human-level control through deep reinforcement learning' Mnih et. al.) since it is a method that uses neural networks to learn non-linear functions for solving sequential decision tasks in high-dimensional feature spaces with unknown dynamics. In 'Preliminaries': There was recent work on using search-based methods to form efficient bounded approximations of Z in max ent ioc for large domains ('Softstar: Heuristic-Guided Probabilistic Inference' Monfort et. al.) In 'Representation and regularization': Line 550 could use a citation pointing to which prior methods are being referenced. An additional issue with the paper is the reference to learning a deep network for the cost function. While the algorithm does allow for a deep neural network to be used, the presented applications use a shallow network of only 2 hidden layers with a low number of parameters (when compared to standard deep networks). This leaves the question on whether the gradient presented in section 4.1 would be effective in more complex architectures. The use of the term 'deep networks' should be supported with evidence that the method works with a deep network. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper extends sample-based maximum entropy IRL to non-linear cost functions parameterized as neural networks. In addition, it guides the sample generation for IRL by optimizing for policy using LQR controller. In this way it simultaneously learns the cost function and a policy. The propose method is demonstrated on simulated tasks and real-world robotic tasks on PR2 robot. Clarity - Justification: Overall the paper is easy to follow. The notation Z in line #349 is overloaded. The use of \vspace in the caption of Figure 2 should be minimized. Figure 1 is never referenced in the text. Significance - Justification: IRL is of interest to robotics community and the paper proposes a deep learning IRL framework when dynamics is unknown. For many real-world robotics problems the dynamics is not known in advance and hence this work could be useful. However, I think ICML is not the best fit for this work. A better venue for this paper could be R:SS. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Three important components of this work are sample-based MaxEnt IRL, parameterization of the cost function, and policy optimization under unknown dynamics. Individually each component has been studied before, this paper combines them into a single system and demonstrate it on several robotics tasks. The paper lacks scientific novelty and it is not clear which parts of the system are necessary. For example, previous works have studied IRL under unknown dynamics (line 226). It is not clear whether it is necessary to include policy learning or one can just replace linear cost of previous works with a non-linear cost? How well will such an approach work? Furthermore, it is also not clear what is the benefit of learning both the cost and policy. Authors say that sometimes when the cost fails the policy can still be used. They give some intuition behind this. However, what is the scientific justification? Can similar results be reproduced on other problems? The reason this is important because policy learning is probably the only thing that distinguish the proposed algorithm from previous works on sample based IRL (in addition to using a non-linear cost). Hence the authors should properly justify the importance of learning a policy. In the present form, policy learning appears to be forced into the formulation. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper contributes a sample-based MaxEnt IRL algorithm for robot manipulation that (a) applies neural networks for feature representation, and (b) solves policy learning under unknown dynamics (jointly with cost learning). Clarity - Justification: This paper is very dense and many details are omitted. In particular, since the policy learning part largely depends on (Levine and Abbeel, 2014), more explanation of the algorithm would be helpful. Also there is little description of the baselines in the experiment section. What are you comparing against and why are these prior work chosen? Significance - Justification: I'm not an expert in this field, but it seems similar ideas such as neural network cost representation (Wulfmeier et al., 2015), sampling from the learned policy (Kalakrishnan et al., 2013) and learning with unknown dynamics (Levine & Abbeel, 2014) have been tried. However, this paper is the first to unify these approaches and the empirical evaluation is convincing. Therefore I think it makes a contribution to the field. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I think this paper provides a nice framework to combine multiple works that address different challenges in MaxEnt IRL. That said, I have to say that I don't have the knowledge to situate this work in the existing literature. One advantage of the proposed algorithm is that it does not have a separate policy learning pass; instead, the cost function and the policy are updated alternatively inside the same iteration loop. I'm curious though if interleaving the two optimization will result in some local optima problem. The baselines needs more explanation. Also, why results of RelEnt IRL are n/a in Table 1? It seems demo init has less variance but converges to higher cost than rand init? Any explanation on this? Minor: What is LQR? typo: "we that our objective ..." =====