Taming Stochastic Gradient Descent: Almost Sure Convergence and Saddle-Point Avoidance under $(L_{0},L_{1})$-Smoothness
Vassilis Apidopoulos ⋅ Iosif Lytras ⋅ Panayotis Mertikopoulos
Abstract
Many optimization problems in machine learning and data science—from deep neural networks to Bayesian inference and beyond—fall outside the standard Lipschitz smoothness framework that underpins the convergence theory of stochastic gradient descent (SGD). Motivated by this theory-practice disconnect, we examine the almost sure convergence of the trajectories of SGD in non-convex landscapes under a generalized $(L_0,L)1)$-smoothness condition which allows for gradients with superlinear growth (even exponential). We begin by proposing a taming scheme for SGD that achieves almost sure convergence under a generalized ABC-type condition on the gradient noise. Subsequently, to relax this requirement, we introduce a more flexible, dissipative taming scheme which converges almost surely under less restrictive moment bound conditions for the stochastic gradients entering the process. For both taming schemes, we show that the generated trajectories avoid strict saddle points (and/or manifolds thereof) with probability 1 so, generically, both methods only converge to local minimizers.
Successful Page Load