BAS: Bridging Adam and SignSGD for Memory-Efficient LLM Training
Abstract
We propose Block Adaptive Signum (BAS), which bridges Adam and SignSGD via block-wise scaling of sign updates. By discarding element-wise second moments, BAS reduces memory overhead relative to AdamW without sacrificing performance. Crucially, BAS mimics Adam’s dynamics closely enough to directly inherit its hyperparameters, matching the performance of AdamW without the need for re‑tuning, a common fragility of prior low‑memory optimizers. This structural alignment makes it particularly suitable for tuning Adam-pretrained models. Furthermore, we exploit the inherent robustness of sign-based updates to store the first moment in FP8 without performance degradation. This shrinks the optimizer‑state footprint to 12.5\% of AdamW’s. We theoretically prove convergence under standard assumptions and introduce a communication-efficient variant enabled by the sign-based update. Across extensive evaluations, including pre‑training a 1.5B model on 100B tokens and supervised fine-tuning of models up to 32B parameters, we demonstrate that BAS achieves performance on par with AdamW.