Collaborative Disagreement Resolution for Scalable Oversight
Abstract
Debate, where AI agents argue opposing positions, has emerged as a key approach to scalable oversight. However, debate faces a fundamental tension: models are incentivized to be persuasive to the judge, which may not always align with epistemic honesty. In this work, we propose an alternative paradigm: disagreement resolution, which reframes the interaction mechanism from adversarial debate to collaborative truth seeking. Drawing on principles from human mediation and conflict resolution, where mediators facilitate dialogue to help disputing parties reach consensus rather than adjudicating between them, we design an automated pipeline that adapts these strategies to AI oversight. Unlike standard debate where models argue for fixed positions, our pipeline directs models to collaboratively identify points of disagreement, examine the evidence for conflicting claims, and converge toward consensus or isolate the specific ``crux'' of their disagreement. We find that Disagreement Resolution consistently helps non-expert models identify the truth, achieving 62.1\% judging accuracy compared to 49.2\% for standard debate. Our results provide encouraging empirical evidence for rethinking the scalable oversight protocol from adversarial persuasion to collaborative truth-seeking.