Poster
in
Workshop: 2nd Workshop on Advancing Neural Network Training : Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024)
u-μP: The Unit-Scaled Maximal Update Parametrization
Charlie Blake · Constantin Eichenberg · Josef Dean · Lukas Balles · Luke Prince · Björn Deiseroth · Andres Felipe Cruz Salinas · Carlo Luschi · Samuel Weinbach · Douglas Orr
The recent Maximal Update Parametrization (µP) enables the hyperparameters for small models to transfer directly to large ones, substantially reducing the cost of training by avoiding expensive sweeps at scale. We present a new scheme, u-µP, which improves upon µP by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: µP ensures that the scale of activations is independent of model size, and Unit Scaling ensures that the starting-scale of these activations is one (along with weights and gradients). This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-µP models reaching a lower loss than comparable µP models and working out-of-the-box in FP8.