D-FUSEr: Diverse Failure, Unified Success via Error-Distribution Shaping in LLM Reasoning
Abstract
Test-time scaling methods such as majority vote aggregation and iterative refinement (e.g., self-reflection or multi-agent inference) improve reasoning performance by leveraging multiple solution samples. However, their efficacy depends not only on raw performance, but critically on the distribution of errors across samples. When errors concentrate, (a) aggregation accuracy degrades, as the majority vote may select a shared mistake, and (b) confidence in common mistakes may suppress exploration in iterative refinement. We argue that improving correctness alone is not sufficient to mitigate these issues; to this end, we propose to explicitly shape error distributions to improve aggregation. First, we introduce a theoretically grounded \textbf{diverse failure reward} that incentivizes calibrated disagreement within model errors. We prove that this reward directly optimizes majority-vote accuracy: policies achieving higher reward attain higher expected majority-vote performance, and vice versa. We further show that this theoretical property generalizes to iterative refinement. Second, we introduce \textbf{anti-votes}, in which the model predicts the most common mistake alongside its solution, allowing probability mass on dominant errors to be explicitly reweighted. We identify conditions under which anti-votes are guaranteed to improve majority-vote accuracy. Empirically, across three model families of varying sizes and four benchmarks, we show that both approaches substantially improve majority vote and iterative refinement performance without degrading single-sample accuracy.