A Few Teacher Steps Go a Long Way: Cost-Efficient On-Policy Data Augmentation for Agent Post-Training
Abstract
Modern training pipelines for language model agents begin with a supervised fine-tuning stage in which a small student imitates a costly teacher. Recent work mitigates the covariate shift of pure imitation learning by collecting teacher feedback at states the student itself reaches, with a prevailing trend toward elaborate filters on the teacher's responses. We frame this design choice as a budget-allocation problem and compare three constructions of supervised training data: short unfiltered teacher continuations at learner-induced states; full teacher trajectories filtered for success (the rejection-sampling step in recent on-policy expert-correction work); or those further restricted to tasks the student cannot solve on its own. Across three agentic benchmarks (HotpotQA, ALFWorld, and Terminal-Bench-Dev), short unfiltered teacher continuations beat pure behavioral cloning at matched supervision budgets, and on HotpotQA also match or exceed the filtered alternatives. On Terminal-Bench-Dev, this augmentation at one-tenth of the SFT corpus and with no reinforcement-learning stage matches the OpenThoughts-Agent baseline that uses the full corpus together with reinforcement learning. The same supervised-learning checkpoints further yield faster early-stage gains under subsequent reinforcement learning. All our findings suggest that, spending teacher budget on broader learner-state coverage can be more effective than spending it on longer or more heavily filtered teacher completions.