Evaluator Failure Modes in Agentic Uncertainty Quantification
Suresh Raghu ⋅ Satwik Pandey ⋅ Shashwat Pandey
Abstract
Standard agentic UQ evaluations can hide trace-level failure modes. AUROC, AUPRC, risk-coverage, Trajectory ECE, and scalarized Trajectory Brier evaluate rankings, binwise calibration, or collapsed trajectory summaries, but none strictly elicit the prefix-conditioned success-probability process $q_t=\mathbb{P}^{\pi}(Y=1\mid\mathcal{H}_t)$. The result is a practical diagnostic failure: a confidence trace can appear acceptable under standard metrics while being badly mis-scaled for deferral, reflection, human handoff, or cost-weighted decisions. We characterize this failure mode theoretically and empirically. Theoretically, we show that Trajectory ECE is resolution-blind and that scalarized Trajectory Brier under common aggregators is not strictly proper for the trace. Empirically, on Tau2-Bench, Platt recalibration changes AUROC by only $\Delta/\mathrm{SE}\approx 0.3$ while changing a strictly proper trajectory score by $\Delta/\mathrm{SE}\approx 43$; on WebShop, complete-only evaluation drops 47.08% of the assumption-valid working sample, the dropped trajectories are roughly $3\times$ longer, and censored-aware scoring changes the reported score. As a fix, we introduce the Trajectory Proper Score (TPS), a strictly proper trajectory-level evaluator built from any strictly proper binary score and positive trajectory weights, with a conditional-projection extension for administratively censored prefixes. Experiments on StrategyQA, Tau2-Bench, HotpotQA, and WebShop show that evaluator choice can shift benchmark conclusions by margins far larger than bootstrap uncertainty.
Successful Page Load