The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Reasoning
Abstract
Knowledge distillation from powerful reasoning models underpins the development of Small Language Models (SLMs). A prevailing assumption in this paradigm is that training data with higher perceived quality, often defined by rigorous logic and superior reward scores, monotonically enhances downstream performance. In this paper, we identify a counter-intuitive \textbf{Quality-Utility Paradox} across diverse model families(Qwen2.5, LLaMA-3, DeepSeek): data refined by a superior Synthesis Oracle consistently underperforms the SLM's self-generated Rejection Sampling (RFT) data, despite achieving higher reward scores. We argue that Oracle models introduce an intrinsic representation bias that shifts training data into a distribution incompatible with the target SLM, where the SLM allocates limited computational capacity to stylistic imitation rather than logical reasoning. We utilize a \textbf{Style-Aligned Refinement} strategy to correct logical errors and strictly preserve the SLM's native syntax. Our experiments demonstrate that maintaining native syntax effectively mitigates syntactic adaptation costs, enabling distilled models to match or even surpass self-generated baselines. These findings underscore the necessity of syntactic alignment and advocate for model-aware reward designs that prioritize distributional compatibility alongside logical rigor. Our datasets and code will be publicly available.