PRISM: Demystifying Retention and Interaction in Mid-Training
Bharat Runwal ⋅ Ashish Agrawal ⋅ Anurag Roy ⋅ Rameswar Panda
Abstract
Mid-training is increasingly used to improve the reasoning capabilities of large language models (LLMs), yet its design choices and interaction with evaluation and reinforcement learning (RL) remain poorly understood. Prior work often focuses on narrow domain gains, overlooking retention of general abilities, long-context performance, and RL compatibility. We present $\textbf{PRISM}$ (Demystifying Retention and Interaction in Mid-Training), a holistic empirical study that analyzes mid-training design choices, what to evaluate, and how domain mixtures and training stages interact across model families. Experiments on Granite-3.3 8B, LLaMA-3.1 8B, and Mistral-7B/24B base models show that a relatively small, high-quality mid-training phase of $\textbf{$\sim$27B}$ tokens acts as a critical stabilizing stage for reasoning. Across models, PRISM yields consistent gains of $\textbf{$\sim$6–10}$ points on coding benchmarks and $\textbf{$\sim$17–30}$ points on mathematical reasoning benchmarks while preserving general performance. RL applied on top of PRISM-mid-trained models produces stable, monotonic improvements, adding a further $\textbf{$\sim$3–8}$ points across coding and math tasks such as LiveCodeBench, Codeforces, AIME and MATH500, and $\textbf{$\sim$17–20}$ points on science (GPQA-Diamond), whereas RL applied directly to base models is substantially less effective. Our results demonstrate that retention-aware mid-training is a necessary intermediate step for reliable reasoning enhancement and RL scaling, and provide practical guidance for designing robust mid-training pipelines for modern LLMs.
Successful Page Load