Poster

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Ziyu Huan ⋅ Yuetai Li ⋅ Tuney Zheng ⋅ Xiaoyu Xu ⋅ Seungone Kim ⋅ Minxin Du ⋅ Radha Poovendran ⋅ Graham Neubig ⋅ Xiang Yue

Abstract

Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments using math-only data with two widely-used methods: Reinforcement Learning (RL) and Supervised Finetuning (SFT) with detailed ablations. On top of the observation that RL-tuned models transfer better than SFT-tuned model, we identify on-policy fine-tuning as the key mechanism underlying cross-domain transfer, regardless of whether the training signal comes from RL or supervised learning. Latent-space representation and token-space distribution shift analyses reveal that off-policy SFT induces substantial representation and output drift, while on-policy RL preserves general-domain structure. Our results suggest a need to rethink the post-training recipes, particularly the reliance on off-policy SFT-distilled data to advance reasoning models.