General Analysis of LMO-based Optimizers: Beyond Bounded Variance
Egor Shulgin ⋅ Mohamed Awad ⋅ Peter Richtarik ⋅ Eduard Gorbunov
Abstract
We study a broad family of momentum *Linear Minimization Oracle* (LMO) methods that includes normalized SGD with momentum, sign-based (Adam-like) directions, and Muon (spectral) updates. Our focus is on subsampling regimes where the classical uniformly-bounded-variance model can be fragile even for finite-sum objectives on unbounded domains. To obtain subsampling-faithful guarantees, we analyze this LMO family under *expected smoothness* (ABC condition), which captures common sampling schemes. We establish a unified nonconvex convergence theory via a new self-bounding closure that handles the history-coupling induced by momentum under ABC. Our bounds recover known bounded-variance results as a special case and simplify in strong-growth regimes. Specializing to $\tau$-nice sampling, we derive explicit batch-size scaling laws, predicting that the optimal momentum must increase with the batch size to maximize sample efficiency. We further identify a theoretical **optimal batch size** that minimizes total sample complexity. Experiments on linear and matrix regression corroborate these predictions, showing a distinct diagonal shift in the optimal momentum-batch landscape that matches our theoretical scaling.
Successful Page Load