Skip to yearly menu bar Skip to main content


Poster
in
Workshop: High-dimensional Learning Dynamics Workshop: The Emergence of Structure and Reasoning

u-μP: The Unit-Scaled Maximal Update Parametrization

Charlie Blake · Constantin Eichenberg · Josef Dean · Lukas Balles · Luke Prince · Björn Deiseroth · Andres Felipe Cruz Salinas · Carlo Luschi · Samuel Weinbach · Douglas Orr


Abstract:

The recent Maximal Update Parametrization (µP) enables the hyperparameters for small modelsto transfer directly to large ones, substantially reducing the cost of training by avoiding expensivesweeps at scale. We present a new scheme, u-µP, which improves upon µP by combining it withUnit Scaling, a method for designing models that makes them easy to train in low-precision. Thetwo techniques have a natural affinity: µP ensures that the scale of activations is independent ofmodel size, and Unit Scaling ensures that the starting-scale of these activations is one (along withweights and gradients). This synthesis opens the door to a simpler scheme, whose default values arenear-optimal. This in turn facilitates a more efficient sweeping strategy, with u-µP models reachinga lower loss than comparable µP models and working out-of-the-box in FP8.

Chat is not available.