We thank the reviewers for their thoughtful and constructive comments. Their main concerns are: (A) clarity of presentation, (B) difficulty of function approximation for policies and costs in large environments, (C) convergence of our algorithm, and (D) experimental comparisons against prior work.$
A: All reviewers expressed concerns about the readability of our paper. We will consolidate our algorithm into pseudocode, improve our notation to eliminate excess subscripts, and insert explanations for the motivations behind our notation.

B: Reviewer 1 stated that while our method can scale to large environments because we fit parameterized policies, we did not address the difficulty of function approximation for policies and costs in such large environments, and that we might need large number of basis functions for success. We would like to clarify that in our reported experiments, we demonstrated that we successfully use large neural network policies with tens of thousands parameters, matching the capabilities of state-of-the-art deep RL algorithms such as Deep Q Learning (Minh et al, 2013, 2015) and TRPO (Schulman et al, 2015) in training large function approximators. As for cost functions, we restrict ourselves to linearly-parameterized costs, as mentioned at the end of Section 3.1.2. We leave the use of more expressive cost classes (such as neural network costs) to future work.

C: Reviewer 3 raised concern about lack of explicit theoretical justification of convergence for our algorithms. We do obtain guaranteed convergence to local minima, because our methods are local optimization algorithms on a certain objective (Equation 6). Convergence of our REINFORCE variant (Section 4.1) follows from the well-known convergence of stochastic gradient descent, and the convergence of our TRPO variant (Section 4.2) follows because it is a majorization-minimization algorithm, which we prove in Equations 23-29 in Section 4.2.2. We will ensure that these guarantees are clear in the text as well.

D: All reviewers expressed concerns about experiments, so we performed additional experiments that we will include in the final version. To demonstrate the quality of the locally-optimal policies returned by our algorithm (addressing concern C), we compared our REINFORCE variant, using Gibbs tabular policies, to LPAL (Syed, Bowling and Schapire 2008), an LP-based algorithm for tabular settings that is guaranteed to give globally optimal apprentice policies. We ran both algorithms on the gridworld environment of Abbeel and Ng (2004) with the same experimental methodology, and we found that despite our local optimality guarantee, we consistently learned policies achieving at least 98% the performance of policies learned by LPAL with similar sample complexity. Additionally, our algorithm’s training times scale favorably compared to LPAL. For a large gridworld with 65536 states, LPAL, using the Gurobi LP solver, took on average 10 minutes to train with large variance across instantiations of the expert to imitate, whereas our algorithm consistently took around 4 minutes.

To demonstrate our algorithm’s performance on a continuous environment, we ran an experiment in the planar navigation environment of Levine and Koltun (2012), in which the agent moves in a plane to seek out Gaussian-shaped costs, and the algorithm must imitate expert behavior without knowledge of the cost coefficients. We compared the trajectories produced by our TRPO variant’s learned policy to those produced by trajectory optimization on a cost learned by Levine and Koltun’s CIOC algorithm, a model-based IRL method designed for continuous settings with full knowledge of dynamics derivatives. We found that even though our method is model-free and does not use dynamics derivatives, it consistently learned policies achieving zero reward loss (that is, perfect imitation) when given at least 16 demonstration trajectories, matching CIOC’s performance. With fewer demonstrations, our algorithm did not successfully match CIOC’s performance. We suspect this is because knowledge of dynamics derivatives gives significant prior knowledge to CIOC.

Finally, we ran our TRPO variant on Levine and Koltun’s highway driving task, in which the learner must imitate driving behaviors (aggressive, tailgating, evasive) in a continuous driving simulation. Levine and Koltun computed their driving behaviors using a trajectory optimization procedure allowed to observe locations of all cars on the road during planning. We made the task harder and more realistic by learning a neural network policy only allowed to see states through a 610-dimensional partial observation---a depth image of nearby cars and lane markings within field of view. Despite this increased difficulty, our algorithm successfully learned driving policies for all behaviors in the task’s dataset, matching behaviors learned by CIOC according to behavior statistics measured by Levine and Koltun.