Which Reasoning Traces Are Worth Generating Further? Data Curation for Training Reasoning Models
Abstract
Supervised fine-tuning (SFT) on a small high-quality set of long reasoning traces is an effective way to enable strong reasoning abilities for Large Language Models (LLMs). However, curating a high-quality SFT data requires generating a large pool of long Chain of Thoughts (CoTs), and filtering the generated data for diversity and difficulty. Both stages rely heavily on strong reasoning models and make data curation prohibitively expensive. In this work, we show that diverse and difficult reasoning examples can be identified very early during their generation. We show that after generating as few as 100 out of 34k tokens of a reasoning trace, challenging examples can be reliably identified based on their loss at a highly perturbed checkpoints of the pretrained model. Then, we prove that examples with similar loss trajectory, i.e., value at a few noisy, perturbed checkpoints of the pretrained model, have similar gradients. A diverse subset can be then found by sampling from clusters of loss trajectories obtained after generating 1k tokens. Our extensive experiments for fine-tuning Qwen2.5-7B and Llama3 on m23K medical reasoning and Openthoughs datasets confirms the effectiveness of our approach. our approach outperforms existing baselines by up to 2\% while generating only as few as 9\% tokens for reasoning traces.