M+Adam: Low-Precision Training via Mantissa–Exponent Optimization
Xiaoyuan Liang ⋅ Sebastian Loeschcke ⋅ Mads Toftrup ⋅ Anima Anandkumar
Abstract
Low-precision formats such as BF16 and FP8 can greatly improve training efficiency, but fully low-precision training often degrades accuracy under standard optimizers. We identify a key cause: additive updates can vanish under coarse mantissa resolution. We introduce M+Adam, an optimizer for stable low-precision training that operates on a mantissa--exponent decomposition of weights and carries out Adam-Madam updates in parallel. Madam is a multiplicative analogue of Adam, where instead of applying additive updates, it performs updates multiplicatively, which is naturally suited for updating exponents. Building on this idea, \method applies additive updates to the mantissa and multiplicative updates to the exponent in parallel. We demonstrate the complementary failure modes of purely additive or multiplicative updates under quantization and thus, our method that combines both can overcome all these failure modes. We establish a monotone descent guarantee under standard smoothness assumptions for our method. Under a challenging setting where both the weight and compute are in FP8, M+Adam substantially outperforms AdamW (e.g., by 10.51\% perplexity at 350M). Moreover, M+Adam enables stable BF16 training without stochastic rounding and consistently outperforms AdamW across 60M-350M models and $1$-$8\times$ Chinchilla budgets.
Successful Page Load