Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
Abstract
On-policy reinforcement learning methods like GRPO suffer from \emph{mode collapse}: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (\textbf{D}istribution-\textbf{M}atching \textbf{P}olicy \textbf{O}ptimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group-level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training. We validate DMPO on NP-hard combinatorial optimization, where exponentially many feasible solutions exist but only a few approach optimality—an ideal testbed for evaluating exploration. DMPO achieves {43.9\% Quality Ratio on text-based NP-Bench (vs. GRPO's 40.1\%)} and {43.1\% on vision-based NP-Bench (vs. 38.4\%)}—demonstrating 9\% and 12\% relative improvements respectively. These gains generalize to mathematical reasoning (+2.0\%) and out-of-domain tasks (+2.3\%), showing that diversity-preserving training enhances general reasoning capabilities across modalities. Our work establishes distribution matching as a practical, principled approach to preventing mode collapse in on-policy RL, with consistent quality improvements demonstrating sustained exploration across diverse reasoning tasks.