Paper ID: 789 Title: Fast Parameter Inference in Nonlinear Dynamical Systems using Iterative Gradient Matching Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Authors propose a nonlinear optimization framework for estimating unknown parameters in nonlinear ODE driven systems. Paper and methods is interesting; two issues are (1) Novelty compared to prior art, e.g. Gonzales et al. 2013, 2014, and advantages over prior art. Clarity - Justification: The difference between Gonzales et. al. (2013, 2014), as described in equations 10-12, and the new approach, described in (14), is difficult to follow. While the topic is sophisticated, it seems that the authors can do more to make the connection (and differences) clear. See the detailed comments. Significance - Justification: It's not clear that the new method is a significant novel extension of Gonzales et al, possibly because of the presentation. It is also not clear that there are significant performance improvements. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Clarity: Eqs. 10-12 appear to be the Fenchel dual to (8), near as I can tell. It's hard to follow what's approximated and what isn't. Is K_g approximated? is its inverse approximated? Is P approximated? (13) is the standard representer theorem, which comes from primal-dual relationships, analogous to the relationship between (8) and (10-12). Are the bs in (13) different from the \alphas in (10)? Clearly some of the same elements are present in both the Gonzales approach and the proposed approach. However, the expressions for (14) and (10-12) are hard to compare. Is the novelty of the approach that you are incorporating gradient terms? How does the remaining part of objective (14) compare with (10)-(12)? Are the dimensionalities of the unknown parameters the same or different in the two approaches? The formulation and methods are interesting, and I agree that taking advantage of (13) in the optimization is a great idea. However, the method is clearly close in spirit to Gonzales et al, and yet it is hard for me to precisely differentiate it because of the presentation in sections 2 and 3. Moreover, the performance of the new method (in terms of quality of parameter estimation) appears to be generally similar to that of Gonzales et al. I think statistical significance is a strange way to present results. Instead, I would present relative errors of estimates, or relative error differences. In terms of time, how does one factor in the training for parameter \rho in (14)? Is that included in the timing? At what point in steps 1-3 is it done? What is the effect of \rho on the parameter estimates? Overall, I think the paper is interesting, and can benefit from clearer explanations, and a more intuitive presentation of numerical results. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a computationally efficient method to estimate the parameters of non-linear ordinary differential equations without relying on numerical integration. The method relies on kernel regression to model the states of the dynamical system. It uses a gradient matching term to regularize the kernel-based state model. The gradient matching regularizer penalizes discrepancies between the gradient of the kernel-based state model and that given by the ODE. Parameter fitting is done in a two stage procedure were the parameters of the kernel are fitted while keeping the parameters of the ODE fixed and vice versa. The main contribution of the paper is a computational efficient way to do the parameter estimation in this two step procedure. Clarity - Justification: The paper is well written and rather easy to follow. Significance - Justification: My understanding is that the main contribution of the paper is the optimization procedure used to optimize over the two set of parameters, i.e. the kernel parameters and the ODE parameters, which brings in significant computational benefits. All the other components, i.e. the kernel regression and gradient matching, have already been used in parameter estimation of ODEs. The paper contains a set of experiments on the estimation of parameters of two dynamical systems. The experiments show that the method in addition to being computationally efficient outperforms in the quality of the estimated parameters a number of state of the art methods for ODE parameter estimation. While the estimation of the parameters of non-linear ODEs is an important issue for a very large number of problems and domains, I am wondering whether this would be of interest for the ICML audience, especially since the paper does not bring something new from a learning perspective. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I do not have any other comments. As I said the paper reads well, the method is well explained and the experiments are well described. Some small details like: what is the multi-layer kernel perceptron given in eq (25)? is there some reference to it? line 301-302 a tilde is missing from one of the Bs. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a novel algorithm for parameter estimation in ODEs from noisy data. A loss function based on kernel regression is optimized with an EM-like algorithm. The authors show that the EM-like algorithm converges to a local optimum. The new method is compared to existing work. Clarity - Justification: The exposition is clearly written and good to follow. I found the presentation of the experimental results unintuitive, but also a bit redundant. I would have preferred to find a more inclusive comparison with further related work other than Gonzalez et al. (2014). Significance - Justification: This is a novel approach to the best of my knowledge. The progress is substantial, but not groundbreaking. Showing the local convergence of the alternating update scheme is a nice addition to the paper. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The manuscript is a solid piece of research which should be considerd for publication. The derivation is clear and convincing and the experimental evidence shows applicability to real world problems. However, I think that the paper might have made a much stronger point if it also compared the proposed method to recently published GP-based approaches like Calderhead et al. (2009) or Dondelinger et al. (2013). The authors only address this by noting that the 'focus of [the paper] has been on fast inference'. It should have been easy to provide a runtime comparison which makes this point. =====