Skip to yearly menu bar Skip to main content


Apple

Expo Talk Panel

Distillation Scaling Laws

Dan Busbridge · Jason Ramapuram · Russ Webb · Floris Weers


Abstract:

Smaller models are cheaper to serve, faster, use less battery, produce less heat, have a lower inference carbon footprint, and are easier to study for academics. Historically, small, capable models have been expensive to train. Knowledge distillation can reduce pretraining costs, yet is poorly understood - we find a distillation scaling law, enabling efficient pretraining strategies, bringing our understanding of distillation closer to our understanding of supervised pretraining.

Chat is not available.