Paper ID: 327 Title: Smooth Imitation Learning for Online Sequence Prediction Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper formalizes the task of smooth imitation learning, where the trained policy should not only closely mimic some expert policy, but also have a certain smoothness. A principled online algorithm called SIMILE is introduced that can solve this task. In addition, several performance guarantees are derived. Empirical experiments demonstrate the effectiveness of the method using an automated camera control domain. Clarity - Justification: The paper is very well written and has a good overall structure. Significance - Justification: Imitation learning is a challenging and popular task. The authors formalize an extension of the typical imitation learning setting by considering a smooth policy class. This could potentially be benificial for many robotic applications, an important application domain of imitation learning. Besides formalizing the problem, the authors give an effective algorithm and provide theoretical guarantees. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): A very nice paper. Clearly written and a significant contribution in my opinion. A few minor points: The authors mention that one of the advantages of SIMILE is that it produces deterministic policy, rather than a stochastic one. I think the arguments for this are rather weak. In fact, an argument used in many policy gradient papers is that it leans a stochastic policy rather than a deterministic one. These papers argue that in partial observable environments stochastic policy are often superior to deterministic ones. The task used for testing looks quite small, with a one-dimensional input and a one-dimensional output. One of the big challenges with imitation learning is generalization to states where for which there are no expert samples in its training set. How well SIMILE deals with this remains largely unclear. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper tackles the problem of imitation learning adding a "smoothness" constraint in the execution of the given task. The primary motivation lies in the necessity of executing smooth movement while automatically recording with a camera. The algorithm is analyzed and proved to converge, furthermore, it benefits from an adaptive learning rate. Clarity - Justification: The paper is overall clear and well written. Significance - Justification: The work is deeply inspired by SEARN (Daumé III et al. 2009) but the contributions are significant. The motivation is strong given the ubiquitous need of smoothness in automatic camera control and the theoretical analysis is sound. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is overall clear and the contributions are significant. It would have benefited from a second example of applications of smooth control learning though. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper contributes a reduction-style algorithm for "smooth" imitation learning in a continuous environment, where "smooth" means that the policy should be stable, i.e. has low curvature. Smoothness is ensured by a smooth regularizer and using smooth feedback as learning targets. They also prove faster convergence rate leveraging the smooth conditions and adaptive learning rate. Clarity - Justification: I find the paper mostly easy to follow. Some notations are a bit confusing though. - The loss funciton \ell(\pi(s)) does not indicate the ground truth. As it can be either the expert feedback or the virtual feedback, the notation should make this unambiguous. - Section 4, "\ell_n is the imitation loss...": what's the difference between \ell_n and \ell? - Section 1, "...super-linear training time": why's that? Significance - Justification: The proposed algorithm is largely based on SEARN (Daume III et al, 2009), but with new theoretical analysis. In particular, I find the adaptive learning rate an important contribution in this field. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The proposed smooth imitation learning is an interesting problem that may arise in many applications, e.g. helicopter control, car driving; to the best of my knowledge it has not been formally examined before. However, I'd prefer a different perspective on the problem and solution. I think the true problem here is that we don't have a dynamic oracle/expert, one that shows the agent what to do (considering the smoothness requirement implicitly) when it's off the track. SIMILE addresses this problem by approximating a dynamic oracle based on the prior knowledge that the policy needs to be smooth. Having to learn from a set of static demonstrations without a real-time expert during training imposes serious problems to such interactive imitation learning algorithms. This paper makes a contribution by solving this problem in a specific setting (although it didn't emphasize this point). I have some questions regarding the technical details. - In SEARN, the initial policy is the oracle policy, thus it makes sense to show that the final policy doesn't degrade too much from the initial policy. However, SIMILE uses an initial policy learned from the oracle-generated dataset, and only improvement between two consecutive iterations is shown. What about the error of the final policy compared to the oracle? - In the proof of the Claim following Equation 14 in the appendix, why s_t and s_t' have the same context x_t? I think this is not true in general sequential processes, maybe in this particular problem? Suggestions for the empirical evaluation: - It would be more convincing to include at least one more experiment from another continuous domain, e.g. helicopter control. - There are two factors contributing to the smoothness, the regularizer and the smoothed feedback. It would be nice to see some ablation analysis here. - In Figure 3, it seems increasing \beta would continue to result in higher learning rate. Have you tried \beta larger than 0.2, and will the advantage of adaptive \beta still hold? Typos: PIthat -> that a form myopic -> a form of myopic The drawbacks of this approach are: (i) (ii) (ii) due exploiting -> due to exploiting we see that that -> we see that =====