Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning
Hien Dang ⋅ Pratik Patil ⋅ Alessandro Rinaldo
Abstract
Self-distillation (SD), retraining a student on a mixture of ground-truth labels and a teacher’s own predictions using the same architecture and training data, often improves generalization empirically, but it is unclear when improvement is guaranteed. We study SD for ridge regression with an unconstrained mixing weight $\xi \in \mathbb{R}$. Conditional on the training data, without any distributional assumptions, we prove that for any squared prediction risk $R$ (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher at every regularization level $\lambda$ where the teacher risk is not stationary ($R'(\lambda) \neq 0$). We also characterize the optimal mixing weight $\xi^\star$ in terms of the risk derivative $R'$, showing that it can surprisingly be negative. To further quantify SD risk improvements, we derive exact risk asymptotics in the proportional asymptotics regime for general anisotropic covariance and deterministic signals. Finally, we propose a consistent one-shot tuning method to estimate $\xi^\star$ without retraining, sample splitting, or grid search. Experiments on real-world tasks and pre-trained neural network features validate our theory and tuning method.
Successful Page Load