We thank the reviewers for their constructive comments.$ We would like to clarify that our experiments demonstrate that our method can solve a range of challenging, real-world inverse optimal control (IOC) problems that cannot be solved by prior methods. These results, quantified in Figure 2 and Table 1, demonstrate a substantial improvement over previous work. Attaining such state-of-the-art results under challenging and realistic assumptions constitutes an important empirical advance in IOC/IRL, an area which has been of great interest to the machine learning community in the past few years. We will make the suggested clarifications in the final version. ------- R1 ------- > It is not clear whether it is necessary to include policy learning or one can just replace linear cost of previous works with a non-linear cost? How well will such an approach work? We already compare to this directly (see Fig. 2 and Table 1), and show that our method succeeds whereas the suggested approach fails on all tasks except 2D navigation. > It is not clear which parts of the systems are necessary. In addition to the main experiment mentioned above, we show three ablations in Appendix B involving the importance weights and policy optimization objective. We have since run an additional ablation on regularization, which we would be happy to add to the paper. > it is also not clear what is the benefit of learning both the cost and policy. In all tasks in which the prior methods fail, the policy learned by our method succeeds. Due to the non-convexity of the problem, it is infeasible to learn a global cost function for some complex tasks, given limited demonstration data. Our IOC method can instead find a local solution in which the learned policy converges to the demonstration distribution, while the learned cost may not be a global solution. Note that the prior methods cannot obtain either a successful cost or policy on such tasks, while our approach at least recovers a successful policy. Our experiments in Section 6.2 validate this explanation. We will discuss this more in the final version. > ICML is not the best fit for this work. Most prior IOC/IRL algorithms have appeared in ICML (e.g. Ng & Russell '00, Abbeel & Ng '04, Ratliff et al. '06, Levine & Koltun '12, Dvijotham et al. '13) and other machine learning venues such as NIPS, AAAI, and AISTATS (e.g. Syed & Shapire '07, Ziebart et al. '08, Boularias et al. '11). -------- R2 -------- > The use of the term 'deep networks' should be supported with evidence that the method works with a deep network. We initially chose the term deep because our cost functions involve significantly more parameters than any existing IOC method, but we agree with R2 and will update the title to remove the word deep. > What new problem is being addressed that prevents (Levine and Abbeel 2014) from being used? Levine & Abbeel '14 assume that a cost function for the task is provided, while we are concerned with learning the cost from demonstrations. We will clarify this. > It is unclear why the demonstrated batch needs to be added to the background batch. This is for stability. Without it, the objective optimized by mini-batch optimization methods can be unbounded, as the background samples become infinitely costly. The optimization is particularly prone to this instability when background samples are far from the demonstration distribution, as they are at initialization. We will make this section more clear. > In algorithm 2: Should sample batch be changed to background batch in line 4? Yes. Thank you for catching that. > Suggested citations Mnih et. al., Monfort et al., line 550 We will add these citations. ------- R3 -------- > there is little description of the baselines in the experiment section. What are you comparing against and why are these prior work chosen? A full description of these prior methods, Boularias et al. and Kalakrishnan et al, is in Section 2 and Section 4.1. We chose these methods because they are the only prior IOC methods which can handle unknown dynamics. We will make this more clear. > I'm curious if interleaving the two optimization will result in some local optima problem. Local optima can be a problem for any method that uses nonlinear function approximators (Ratliff et al. '09, Levine & Koltun '12). Interleaving the two optimizations actually makes it more likely to find a better solution with a fixed number of samples. > why results of RelEnt IRL are n/a in Table 1? The result for RelEnt IRL is in Table 1 under "reopt c_theta," because RelEnt IRL does not itself produce a policy, but only a cost function. We will clarify this in the final version. > It seems demo init has less variance but converges to higher cost than rand init? Any explanation on this? The demo init is closer to the desired sampling distribution and thus has lower variance but more bias. We’ll add a discussion of this to the paper.