Formally Exploring Visual Anomaly Detection Evaluation Metrics
Abstract
Inaccurate Visual Anomaly Detection (VAD) can lead to critical failures in safety-sensitive domains, including autonomous navigation and industrial surveillance. With the increasing abundance and rapid proliferation of VAD algorithms, their reliable evaluation has become increasingly important and challenging. Commonly used evaluation metrics often fail to capture practically relevant aspects of model behavior, yielding inconsistent or misleading assessments by overlooking errors such as redundant detections and the spatial distribution of false positives. In this paper, we formalize the requirements for VAD evaluation by introducing a set of axiomatic, verifiable properties that an evaluation metric should satisfy. Through a systematic analysis of state-of-the-art evaluation methods, we show that none satisfies all proposed properties. To address this gap, we introduce SAAM-ALARM, a novel evaluation metric that satisfies these properties. Our results show that SAAM-ALARM provides a more nuanced and theoretically sound assessment, establishing a stronger standard for performance benchmarking in VAD.