From Reasoning Traces to Reusable Modules: Reinforcement Learning for Compositional Generalization in Language Model Reasoning
Abstract
Reinforcement learning (RL) has emerged as a key mechanism for transforming LLMs into robust reasoners. While supervised fine-tuning (SFT) often limits models to the distribution of observed reasoning traces, RL post-training significantly improves performance on out-of-distribution (OOD) tasks that require unfamiliar recombinations of familiar steps. We argue that this improvement is driven by compositional generalization, which we formalize through a Hierarchical Latent Selection Model. In this framework, reasoning traces are generated by a cascade of discrete latent selection variables corresponding to reusable atomic modules, including both skills (local operations) and routing mechanisms (how intermediate information is selected, reused, and composed). We theoretically show that RL’s exploratory nature provides sufficient coverage to identify latent structure and enable compositional generalization. We design controlled experiments to validate this theory. Our results demonstrate that RL can extract atomic modules from compound traces and recombine them to solve new configurations. Moreover, we find that training on compound traces can yield stronger generalization than training on isolated atomic modules. Finally, we investigate relations between SFT and RL and identify an effective protocol in which SFT ensures coverage of all atomic modules, while RL focuses on novel compositions beyond the SFT support to encourage exploration.