SignalBench: Comparing Dense Feedback Methods for Long-Horizon Agents
Abstract
Dense supervision is an essential ingredient when scaling long-horizon agent training, yet we don't really know how well current methods recover the value structure they purport to estimate. A growing line of work -- intrinsic signals, self-distillation, embedding similarities -- can be unified as dense feedback methods that score intermediate states and actions. However, prior works evaluate dense supervision by measuring downstream model performance, an approach that is computationally expensive and entangles the quality of the signal with engineering choices for training. To address this gap, we propose SignalBench: a training-free testbed designed to isolate and evaluate dense supervision methods for agentic systems. SignalBench allows us to comprehensively analyse the correlation of the evaluated signals with reference-policy value labels across 21 dense feedback methods, with over 1.2K evaluations across four diverse environments and six open-weight backbone models. We find that simple prompting baselines consistently outperform state-of-the-art dense feedback methods, and methodological families cluster together in performance. These core findings are surprisingly robust: They hold across model sizes, environments, observation modalities, and estimated signal types. Overall, SignalBench offers a valuable and inexpensive testbed for dense feedback method evaluation across model and task families.