Batched Contextual Reinforcement
Abstract
Chain-of-Thought (CoT) reasoning has significantly improved the performance of Large Language Models (LLMs) but comes with substantial computational costs due to excessive token consumption. Existing approaches to reduce inference latency, such as explicit length penalties, often degrade reasoning quality by truncating necessary logical steps. In this work, we introduce a novel, SFT-free reinforcement learning framework that induces emergent token efficiency without explicit length constraints.We propose Batched Contextual Reinforcement (BCR), a training paradigm where the model is prompted to solve multiple reasoning tasks within a single context window, rewarded by independent instance-level accuracy. This formulation introduces an implicit information bottleneck: to maximize the cumulative reward within the context capacity, the model is forced to eliminate syntactic redundancy and focus attention on the semantic core of the reasoning path.Empirically, our method demonstrates a remarkable shift in the efficiency-accuracy Pareto frontier. Using a 1.5B parameter model JustRL-Deepseek-1.5B, we achieve 39.8--62.6% reduction in token usage across five mathematical reasoning benchmarks while maintaining or improving accuracy on four of them. Most notably, on AMC23 and Minerva, we observe a ``free lunch'' phenomenon where accuracy improves by +2.5% and +5.1% respectively, despite using approximately half the tokens. Extensive ablation studies confirm that batched training acts as a superior form of implicit regularization that reduces hallucinations and sharpens attention. Our findings indicate that LLMs possess latent, high-density reasoning modes that can be unlocked via purely structural incentives in RL.