Poster Tue, Jul 15, 2025 • 11:00 AM – 1:30 PM PDT

Learning Event Completeness for Weakly Supervised Video Anomaly Detection

Yu Wang · Shiwei Chen

[ Poster] [ OpenReview]

Abstract

Weakly supervised video anomaly detection (WS-VAD) is tasked with pinpointing temporal intervals containing anomalous events within untrimmed videos, utilizing only video-level annotations. However, a significant challenge arises due to the absence of dense frame-level annotations, often leading to incomplete localization in existing WS-VAD methods. To address this issue, we present a novel LEC-VAD, Learning Event Completeness for Weakly Supervised Video Anomaly Detection, which features a dual structure designed to encode both category-aware and category-agnostic semantics between vision and language. Within LEC-VAD, we devise semantic regularities that leverage an anomaly-aware Gaussian mixture to learn precise event boundaries, thereby yielding more complete event instances. Besides, we develop a novel memory bank-based prototype learning mechanism to enrich concise text descriptions associated with anomaly-event categories. This innovation bolsters the text's expressiveness, which is crucial for advancing WS-VAD. Our LEC-VAD demonstrates remarkable advancements over the current state-of-the-art methods on two benchmark datasets XD-Violence and UCF-Crime.

Lay Summary

Imagine you’re monitoring security cameras but only know if something unusual happened in a long video, not when or what exactly. This is the challenge of weakly supervised video anomaly detection (WS-VAD). Most existing methods struggle to precisely locate anomalies because they lack detailed, frame-by-frame labels. Our solution, LEC-VAD, acts like a detective trained to notice both broad "vibes" (e.g., "something’s off") and specific clues (e.g., "a fight breaks out").LEC-VAD uses two smart tools:1. Anomaly "GPS": By modeling anomalies as a mix of typical patterns (like a Gaussian “blur” of possibilities), LEC-VAD sharpens the edges of suspicious events, avoiding vague or incomplete guesses.2. Text Memory Bank: It stores concise descriptions of anomalies (e.g., "explosion," "loitering") to refine how the model "talks" about events, making detection sharper.By combining visual clues with language, LEC-VAD outperforms current methods on real-world datasets like XD-Violence and UCF-Crime. This approach could improve surveillance systems, content moderation, or even self-driving cars by spotting dangers faster and with less human effort.

Video

Chat is not available.