PASO: Step Parallel Stochastic Optimization
Jianrong Lu ⋅ Zhuoya Gu ⋅ Haobo Li ⋅ Zhiyu Zhu ⋅ Yechao Zhang ⋅ Jianhai Chen ⋅ Minghui Yang ⋅ Junwei Liu ⋅ Jian Wang ⋅ Qinming He ⋅ Hui LIU ⋅ Junhui Hou
Abstract
This paper approaches the fundamental challenge of accelerating the inherently autoregressive nature of gradient descent (GD) like SGD and Adam through a dynamic system perspective. Specifically, we introduce a unified framework that recasts the autoregressive GD process as solving a system of triangular nonlinear equations (TNEs), thereby enabling \textit{step-parallel} training, where gradients for different GD steps are computed concurrently without sequential dependencies. Within this generic framework, we establish that: (1) the TNE system admits a unique solution corresponding precisely to the autoregressive GD iterative trajectory; (2) solving the TNEs system guarantees convergence to the GD iterative trajectory in at most the equal iterations. Building on these insights, we present \textit{PASO}, the first step-parallel optimizer for accelerating a broad class of GD-based optimizers like SGD and Adam. Extensive experiments (\textit{e.g.}, Llama-3.2-1B and diffusion model) validate that PASO achieves up to \textbf{21}$\times$ reduction in GD steps and \textbf{4.5}$\times$ speedup in wall-clock time, with no model quality loss. Source code is available at: \url{https://anonymous.4open.science/r/PASO-0AF9}.
Successful Page Load