From Optimization to Generalization under Heavy-Tailed Data: The Role of Gradient Clipping
Aleksandr Shestakov ⋅ Martin Takac ⋅ Eduard Gorbunov
Abstract
Gradient clipping is widely used to stabilize stochastic gradient methods and is often theoretically motivated by heavy-tailed gradient noise, where even second moments may be infinite, seemingly contradicting empirical risk minimization where all moments are finite for a fixed dataset. We resolve this paradox by explicitly separating data sampling from optimization randomness: although moments are finite conditional on the dataset, heavy-tailed data induce dataset-dependent noise whose second moment typically grows with the dataset size $N$. In particular, when $\|\nabla f(x_\star,\xi)\|$ has tail index $\alpha \in (1,2)$, the quantity $\frac{1}{N}\sum_{i=1}^N\|\nabla f(x_\star,\xi_i)\|^2$ scales as $N^{\frac{2}{\alpha}-1}$, leading to deteriorating convergence guarantees for standard SGD as $N$ increases. In contrast, we show that stochastic gradient descent with clipping avoids this growth and admits finite-sum convergence guarantees under heavy-tailed data for broad step-size and clipping schedules. We further derive generalization bounds for strongly convex smooth objectives and show that the tail behavior of gradients at the population minimizer is the key quantity linking optimization and generalization under heavy-tailed data.
Successful Page Load