Spotlight
in
Workshop: Continuous Time Perspectives in Machine Learning
The Gap Between Continuous and Discrete Gradient Descent
Amirkeivan Mohtashami · Martin Jaggi · Sebastian Stich
While it is possible to obtain valuable insights by analyzing gradient descent (GD) in its continuous form, we argue that a complete understanding of the mechanics leading to GD's success may indeed require considering effects of using a large step size in the discrete regime. To support this claim, we demonstrate the difference in trajectories for small and large learning rates when GD is applied on a neural network, observing effects of an escape from a local minimum with a large step size. Furthermore, it has been widely observed in neural network training that when applying stochastic gradient descent (SGD), a large step size is essential for obtaining superior models. In this work, through a novel set of experiments, we show even though stochastic noise is beneficial, it is not enough to explain success of SGD and a large learning rate is essential for obtaining the best performance even in stochastic settings. Finally, we prove on a certain class of functions that GD with large step size follows a different trajectory than GD with a small step size which can facilitate convergence to the global minimum.