Calibrating Conservatism for Scalable Oversight
William Overman ⋅ Mohsen Bayati
Abstract
Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans maintain meaningful oversight of systems that may exceed human capabilities? While scalable oversight is widely studied, existing approaches often rely on complex assumptions, remain largely heuristic, or lack practical methods for sequential settings with statistical guarantees. We introduce Calibrated Collective Oversight (CCO), which aggregates diverse auxiliary scoring functions into a penalty that measures deviation from a conservative baseline. Inspired by Attainable Utility Preservation, CCO enables collective conservatism: when multiple oversight signals register concern, the agent defers. CCO calibrates this conservatism online using Conformal Decision Theory, ensuring that undesirable outcomes remain below a user-specified target $\alpha$ with finite-time bounds and no distributional assumptions. Experiments on SWE-bench demonstrate that weaker overseers successfully constrain an adversarially misaligned stronger agent. Similarly, on MACHIAVELLI, CCO achieves substantial reductions in ethical violations while preserving reward. In both settings, empirical violation rates closely match the specified targets. Our work demonstrates that combining penalty-based conservatism with online calibration yields practical oversight with statistical guarantees suited for agentic deployment.
Successful Page Load