When to Think, When to Speak: Learning Disclosure Policies for Large Language Model Reasoning
Abstract
Standard Chain-of-Thought (CoT) reasoning trades reliability for responsiveness: in a single user-visible token stream, more deliberation delays meaningful output, imposing a ``silence tax.'' We introduce \emph{Side-by-Side (SxS) Interleaved Reasoning}, a training framework that makes \emph{disclosure timing} a controllable decision within standard autoregressive generation. By interleaving \emph{supported} partial answers with continued private reasoning in the same context, SxS avoids monolithic reasoning preambles. We treat disclosure as a policy learning problem and train models via a multi-stage pipeline: supervised fine-tuning (SFT) on entailment-aligned interleaved trajectories, followed by reinforcement learning (RL) to recover reasoning performance and optimize accuracy. Without architectural changes, SxS improves the accuracy--latency trade-off across two Qwen3 architectures/scales (MoE \textbf{Qwen3-30B-A3B}, dense \textbf{Qwen3-4B}) and both in-domain (AIME25) and out-of-domain (GPQA-Diamond) benchmarks, reducing \emph{substantive content latency} by 18\% and improving a proxy for perceived wait time by 49\%, yielding more responsive interactions without compromising answer quality.