Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Abstract
Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments using math-only data with two widely-used methods: Reinforcement Learning (RL) and Supervised Finetuning (SFT) with detailed ablations. On top of the observation that RL-tuned models transfer better than SFT-tuned model, we identify on-policy fine-tuning as the key mechanism underlying cross-domain transfer, regardless of whether the training signal comes from RL or supervised learning. Latent-space representation and token-space distribution shift analyses reveal that off-policy SFT induces substantial representation and output drift, while on-policy RL preserves general-domain structure. Our results suggest a need to rethink the post-training recipes, particularly the reliance on off-policy SFT-distilled data to advance reasoning models.