Softsignum: Smooth Your Signum For Better Heterogeneity Handling
Abstract
Sign-based optimization methods, such as SignSGD and Signum, have become essential for modern Deep Learning due to their 1) high performance 2) low memory footprint and 3) communication efficiency. Despite their success, these methods suffer from distinct limitations in the terminal phase of training: they decouple update mechanisms from gradient magnitudes and fail to account for parameter heterogeneity, often leading to oscillation rather than convergence. While switching to SGD represents a potential remedy, a naive "hard" switch is poorly useful due to learning rate mismatches, momentum buffer suboptimality, and the assumption of uniform parameter dynamics. In this work, we propose SoftSignum, a novel optimization method that implements a principled, smooth transition mechanism from sign-based updates to SGD, which adapts to individual parameter sensitivities. We provide a generalized theoretical framework guaranteeing convergence in stochastic non-convex settings relevant to Deep Learning and demonstrate empirically that SoftSignum effectively handles parameter heterogeneity, yielding superior convergence across diverse tasks, including LLM pretraining, compared to standard sign-based baselines.