Escaping the Mode: Multi-Answer Reinforcement Learning in LMs
Abstract
Large language models (LMs) are typically post-trained via RL to produce a single best answer per query, implicitly optimizing for modal correctness. While effective for benchmark accuracy, this approach is unideal for many applications of interest such as in medical diagnosis, which would benefit from models generating a set of plausible answers (ideally paired with uncertainty estimates).This paper describes a multi-answer reinforcement learning (RL) approach for enabling LMs to do this, where we modify the RL objective to train models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model’s generative process. We instantiate this approach through Multi-Answer Reinforcement Learning with Verifiable Rewards (Multi-RLVR), which generalizes ordinary RLVR to the multi-answer case with a set-level reward. We further extend this approach to Multi-Answer Reinforcement Learning with Calibrated Rewards (Multi-RLCR) which adds a set-level Brier score-based calibration objective to enable LMs to output calibrated uncertainty estimates associated with each answer in the output set. Multi-answer training promotes explicit representation of alternative hypotheses rather than repeated generation of the dominant mode. Across question-answering and medical diagnostic benchmarks, we observe improved diversity, recall, and set-level calibration scores compared to single answer-trained baselines. We further observe that models trained with our approach are more token-efficient, requiring fewer tokens to generate multiple answers than competing approaches. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling.