Spotlight
in
Workshop: Continuous Time Perspectives in Machine Learning
Should You Follow the Gradient Flow? Insights from Runge-Kutta Gradient Descent
Xiang Li · Antonio Orvieto
Recently, it has become popular in the machine learning community to model gradient-based optimization algorithms as ordinary differential equations (ODEs). Moreover, state-of-the-art optimizers such as SGD and Momentum can be recovered from the corresponding ODE using first-order numerical integrators such as explicit and symplectic Euler methods. In contrast, very little theoretical and experimental investigation has been carried out on the properties of higher-order integrators in optimization. In this paper, we analyze the properties of high-order Runge-Kutta (RK) integrators on gradient flows, in the context of both convex optimization and deep learning. We show that, while RK provides a close approximation to the gradient flow, this induces an increase in sharpness (maximum Hessian eigenvalue) at the solution – a feature which is believed to be negatively correlated with generalization. In addition, we show that, while high-order RK descent methods are stable for a broad range of stepsizes, convergence speed (in terms of training loss) is usually negatively affected by the method order.