Skip to yearly menu bar Skip to main content


Apple

Expo Talk Panel

Distillation Scaling Laws

Dan Busbridge · Jason Ramapuram · Russ Webb · Floris Weers

West Ballroom A
[ ]
Sun 13 Jul 5 p.m. PDT — 6 p.m. PDT

Abstract:

Smaller models are cheaper to serve, faster, use less battery, produce less heat, have a lower inference carbon footprint, and are easier to study for academics. Historically, small, capable models have been expensive to train. Knowledge distillation can reduce pretraining costs, yet is poorly understood - we find a distillation scaling law, enabling efficient pretraining strategies, bringing our understanding of distillation closer to our understanding of supervised pretraining.

Live content is unavailable. Log in and register to view live content