Invited talk
in
Workshop: Continuous Time Perspectives in Machine Learning
Continuous vs. Discrete Optimization of Deep Neural Networks
Nadav Cohen
Existing analyses of optimization in deep learning are either continuous, focusing on variants of gradient flow (GF), or discrete, directly treating variants of gradient descent (GD). GF is amenable to theoretical analysis, but is stylized and disregards computational efficiency. The extent to which it represents GD is an open question in deep learning theory. My talk will present a recent study of this question. Viewing GD as an approximate numerical solution to the initial value problem of GF, I will show that the degree of approximation depends on the curvature around the GF trajectory, and that over deep neural networks (NNs) with homogeneous activations, GF trajectories enjoy favorable curvature, suggesting they are well approximated by GD. I will then use this finding to translate an analysis of GF over deep linear NNs into a guarantee that GD efficiently converges to global minimum almost surely under random initialization. Finally, I will present experiments suggesting that over simple deep NNs, GD with conventional step size is indeed close to GF. An underlying theme of the talk will be the possibility of GF (or modifications thereof) to unravel mysteries behind deep learning.