Privileged Information Distillation for Language Models
Abstract
Training-time privileged information (PI) can enable language models to succeed on tasks they would otherwise fail, making it a powerful tool for reinforcement learning in hard, long-horizon settings. However, transferring capabilities learned with PI to policies that must act without it at inference time remains a fundamental challenge. We study this problem in the context of distilling frontier models for multi-turn agentic environments, where closed-source systems typically hide their internal reasoning and expose only action trajectories. This breaks standard distillation pipelines, since successful behavior is observable but the reasoning process is not. We introduce π-Distill, a joint teacher–student framework that trains a PI-conditioned teacher and an unconditioned student simultaneously within a single shared-parameter model, enabling the teacher to learn how to use PI while mitigating distribution shift during transfer. We show that π-Distill effectively distills frontier agents using action-only privileged information, matching or outperforming industry-standard pipelines that assume access to full Chain-of-Thought supervision across multiple agentic benchmarks, models, and forms of PI. We complement our results with extensive analysis that characterize what factors enable effective learning with PI.