Expo Talk Panel
Distillation Scaling Laws
Dan Busbridge · Jason Ramapuram · Russ Webb · Floris Weers
West Ballroom A
[
Abstract
]
Sun 13 Jul 5 p.m. PDT
— 6 p.m. PDT
Abstract:
Smaller models are cheaper to serve, faster, use less battery, produce less heat, have a lower inference carbon footprint, and are easier to study for academics. Historically, small, capable models have been expensive to train. Knowledge distillation can reduce pretraining costs, yet is poorly understood - we find a distillation scaling law, enabling efficient pretraining strategies, bringing our understanding of distillation closer to our understanding of supervised pretraining.
Live content is unavailable. Log in and register to view live content