We would like to thank all reviewers for their valuable comments.$ Reviewer 2: Q: Paper would have benefited from a second example of smooth control learning A: We agree that having a second application domain would be beneficial. Reviewer 3: Q: Having to learn from a set of static demonstrations without a real-time expert during training imposes serious problems to interactive imitation learning algorithms. This paper solves this problem in a specific setting (although it didn’t emphasize this point) A: We view our approach as an effective way of imposing regularization to a highly expressive policy class such as neural nets or decision trees. In the camera-planning setting, this regularizer encodes our domain knowledge that a good trajectory should be smooth. In this way we extend approaches like SEARN and DAgger, which do assume access to a dynamic oracle/expert. The key question we ask is, how do we incorporate smoothness constraints to encourage the model to learn a smooth function while still enjoying policy improvement guarantees similar to SEARN and DAgger. For squared imitation loss, we show that it suffices to simulate oracle feedback, rather than requiring continuous access to dynamic oracle demonstration. Note that our other results stand even if we assume oracle feedback like in SEARN and DAgger. Finally, we do make a critical assumption that the world is split into exogenous and internal states. The context from the world x is considered completely exogenous and is not influenced by the actions of our policy. This setting is useful for modeling our motivating application of camera tracking (x is the behavior of players). Other possible settings include helicopter acrobatics (x is the wind turbulence), and smart grid management (x is the energy demand on the grid). We’ll revise the text to better highlight our modeling assumptions and contributions, and we thank the reviewer for highlighting this lack of clarity. Q: In SEARN, the final policy doesn’t degrade too much from the initial policy, which is the oracle. However, SIMILE uses an initial policy learned from the oracle generated dataset, and only improvement between two consecutive iterations is shown. What about the error of the final policy compared to the oracle? A: Adaptive learning rate makes direct comparison between the final and initial policy much more unwieldy compared to SEARN. We instead emphasized that monotonic policy improvement can be achieved with SIMILE, thus guaranteeing convergence. Note that unlike motivating applications of SEARN, the assumption of having access to good initial oracle may not be valid under continuous domains such as camera control. Q: In the proof of the Claim following Equation 14 in the appendix, why s_t and s_t’ have the same context x_t? I think this is not true in general sequential processes? A: This is related to the modeling assumption that contexts x are exogenous. State s consists of input context x (current and past player positions) and past actions (past camera angles). At time t, two different policies pi and pi’ yield two different states s_t and s_t’, but these two states correspond to the same input context x_t, only the resulting sequence of actions would differ. In fact, the low-curvature requirement needs only apply to states that correspond to the same input context x. This may not be true in general sequential processes. Q: It would be nice to see some ablation analysis of the regularizer and the smoothed feedback A: In addition to testing different smooth feedback levels (section 6), we evaluated the effect of varying level of regularization in isolation without smoothed feedback. Figure 7 in the anonymous CVPR submission (included in the supplement) displays the effect of different regularizers. Q: Have you tried beta larger than 0.2, and will the advantage of adaptive beta still hold? A: Yes, very large beta may overshoot and worsen the combined policy after a few initial improvements. Reviewer 6: Q: Many policy gradient papers argue that stochastic policies are often superior to deterministic ones. A: Among algorithms that combine policy across iterations as typified by SEARN (and previously, CPI), we show that in continuous domains, deterministic policy combination at least performs no worse than stochastic policies in expectation (Corollary 5.3), with the added benefit of the final policy being much less wieldy. In the broader context of general reinforcement learning however, deterministic policies are not necessarily better than stochastic ones. Q: The task used for testing looks small, with a one-dimensional input and output. A: All theoretical results in our paper hold for multi-dimensional input contexts x and actions a. In our experiment, x has dimension 14. The case a in dimension k with k>1 can be analyzed by treating k components separately. In the future, we can further verify SIMILE with a larger multi-dimensional data set.