A Diagnostic Study of Multi-Agent LLMs for Real-World Debates
Abstract
Multi-agent LLM debates are increasingly deployed in domains such as policy analysis and city planning, where no objective ground truth exists. Despite this, debate quality is typically evaluated using outcome-based proxies such as LLM-as-judge scores that provide little insight into whether meaningful deliberation has occurred. Additionally, consensus and majority vote are viewed as ideal goals without analyzing the underlying interaction dynamics beneath them. In this work, we introduce a diagnostic evaluation framework that measures debate quality by measuring both the outcome and the process. Grounded in deliberative theory, our framework defines four interpretable process-level metrics capturing engagement, responsiveness, influence asymmetry, and balance, and two outcome-based metrics capturing stability and agent utility. Across both objective benchmarks and real-world domains, we find that process-level diagnostics are consistently more informative than commonly used outcome-based proxies. They better reflect correctness when ground truth exists and align more closely with human judgments of deliberative quality when it does not, revealing interaction failures that outcome-only measures fail to capture. These results demonstrate that process-level diagnostics are necessary for reliable evaluation of multi-agent debates and provide a principled foundation for analyzing and designing deliberative LLM systems.