Poster
in
Workshop: High-dimensional Learning Dynamics Workshop: The Emergence of Structure and Reasoning
u-μP: The Unit-Scaled Maximal Update Parametrization
Charlie Blake · Constantin Eichenberg · Josef Dean · Lukas Balles · Luke Prince · Björn Deiseroth · Andres Felipe Cruz Salinas · Carlo Luschi · Samuel Weinbach · Douglas Orr
The recent Maximal Update Parametrization (µP) enables the hyperparameters for small modelsto transfer directly to large ones, substantially reducing the cost of training by avoiding expensivesweeps at scale. We present a new scheme, u-µP, which improves upon µP by combining it withUnit Scaling, a method for designing models that makes them easy to train in low-precision. Thetwo techniques have a natural affinity: µP ensures that the scale of activations is independent ofmodel size, and Unit Scaling ensures that the starting-scale of these activations is one (along withweights and gradients). This synthesis opens the door to a simpler scheme, whose default values arenear-optimal. This in turn facilitates a more efficient sweeping strategy, with u-µP models reachinga lower loss than comparable µP models and working out-of-the-box in FP8.