We thank the reviewers for their useful comments and apologize for unclear writing and missing details of the experiments. These issues will certainly be fixed in the final version of the paper (see below)$ Reviewer3: - We apologize for the unclear description of the algorithm. We will implement the suggestions of the reviewer, such as adding an algorithm box, restructuring section 4 and the related work. - We agree that the first sentence in the abstract can be misleading as it is only a valid description of a subset of trajectory optimization methods. However, it applies to more methods than GPS, such as DDP, iLQG and AICO. We will clarify that in the abstract. - We thank the reviewer for pointing out the Todorov09 paper and we will include the reference. However, we believe that this method still requires the dynamics model for computing the Hamiltonian, even if it is not linearized. As such, this method cannot directly learn from trajectory data. We plan to compare to this method in future work. We chose to compare our algorithm to GPS because both bound the policy change in a similar way (with a KL-divergence), thus isolating our choice of learning a Q-Function instead of a model of the dynamics. - The experiments already show a case where GPS fails. In the double-link swing up task, MOTO and GPS perform similarly. However, GPS fails to find successful swing-ups in the quad-link setting. If the dynamics are highly non-linear, then GPS cannot deal with large initial exploration noise that creates a too wide distribution making the linear approximation of the dynamics poor. Hence, GPS gets stuck as we are forced to use small exploration as can be seen in the quad-link experiments. - All ideas presented in Section 4 appear at least once in the experimental section except for the forward propagation of the state distribution with importance weights. We still kept a short presentation of it as we found the idea interesting provided one can find a workaround to the degeneracy of the IWs in the latter time-steps, as discussed in the section. Reviewer2: - The single initial demonstration in the Table Tennis task helps MOTO capture the basic template of a forehand strike but contains no correlation between e.g. torque and ball position (since initial demonstration only sets the bias of the policy to the trajectory torques of each time-steps while the gain matrices are initialized to zero). This demonstration indeed helps MOTO find better solutions; and with a sample complexity that, albeit still a bit high, is sufficiently reasonable to envision an application on a physical system (~1000 sampled trajectories for a policy returning the ball almost all the time). Note that this initial demonstration was also used to initialize the forcing function of the DMP (circa 100 parameters) which only leaves the goal position and goal velocity as open parameters (18 parameters), to be optimized by the competing methods. Reviewer1: - In addition to the time-dependency, we generalized the MORE policy update to linear controllers (dependency on the state) instead of a fixed mean Gaussian. - As investigated in the experimental section on the double-link swing-up task, choosing successively smaller M (#samples/iteration) results in moderate speed-ups as M gets small. This might indeed be explained by the degeneracy of the IWs if the step-size is not sufficiently small. One possible perspective is to adapt the step-size to each time-step by linking it to quantities s.a. the uncertainty in the learned Q-Function or the effective number of samples used to learn the Q-Function. Because it is very probable that the IWs degeneracy occurs with varying intensities among time-steps and iterations and that the best fixed step-size among them is too conservative. - We demonstrated our approach on the table tennis problem, which has 9 action and 21 state variables. We believe that the dimensionality of this problem is reasonably high. Bayesian dimensionality reduction as in the MORE paper was not needed as the number of actions (9) is typically much smaller than the number of parameters (20-100). However, dimensionality reduction offers interesting future research directions. Perspectives on the finite horizon concerns: Given a clustering of the state space, we can generalize our algorithm to the infinite horizon case such that to each cluster is associated a local policy, and a local quadratic approximation of the Q-Function to update the policy. This clustering can be returned by an unsupervised learning algorithm. In the present paper however, we restrict ourselves to the particular case where the clustering is a result of the time-dependence assumption together with the Normality of the state distributions. This is both a fairly standard setting in trajectory optimization and one which still has interesting and unsolved applications for which the episodic nature does not seem to be the biggest bottleneck.