Timezone: »

The Gap Between Continuous and Discrete Gradient Descent
Amirkeivan Mohtashami · Martin Jaggi · Sebastian Stich

While it is possible to obtain valuable insights by analyzing gradient descent (GD) in its continuous form, we argue that a complete understanding of the mechanics leading to GD's success may indeed require considering effects of using a large step size in the discrete regime. To support this claim, we demonstrate the difference in trajectories for small and large learning rates when GD is applied on a neural network, observing effects of an escape from a local minimum with a large step size. Furthermore, it has been widely observed in neural network training that when applying stochastic gradient descent (SGD), a large step size is essential for obtaining superior models. In this work, through a novel set of experiments, we show even though stochastic noise is beneficial, it is not enough to explain success of SGD and a large learning rate is essential for obtaining the best performance even in stochastic settings. Finally, we prove on a certain class of functions that GD with large step size follows a different trajectory than GD with a small step size which can facilitate convergence to the global minimum.

Author Information

Amirkeivan Mohtashami (EPFL)
Martin Jaggi (EPFL)
Sebastian Stich (CISPA Helmholtz Center for Information Security gGmbH)

More from the Same Authors