The Role of Feedback Alignment in Self-Distillation
Abstract
Language models often perform better when its prompt includes helpful context, such as feedback on a previous attempt. Self-distillation trains the model to achieve this performance without the context at inference time. It matches the model's output distribution across two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns depends on what context the self-teacher receives, yet its design (e.g. its structure, granularity, and alignment with the student's reasoning) remains largely unexplored. We study feedback design by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, beating GRPO by 16.11 points in Avg@12 on the evaluation set and reference-solution-conditioned self-distillation by 5.27 points. Per-token advantage analysis explains why: step-aligned feedback concentrates distributional shifts at the tokens where reasoning fails, acting as implicit process supervision. In contrast, the reference solution spreads signal across all tokens, because an alternative derivation diverges from the solver's trace even at correct steps. This suggests that structural alignment between feedback and the solver's reasoning is one of the main drivers of self-distillation effectiveness.