Ariadne's Thread of LipSync: Unraveling Forgeries via Inconsistency between Lip Motions and Head Poses
Abstract
Recent advances in LipSync generation technology have led to the creation of highly realistic videos, posing severe societal risks. However, existing defense strategies struggle against LipSync forgeries, as state-of-the-art generative models not only optimize for the lip synchronization but also significantly eliminate visual artifacts, resulting in the lack of key detection signals. Inspired by the inherent biological coupling between lip movements and head poses in natural speech, we observe that generative models fundamentally disrupt this global coordination when optimizing for local lip motion. In this paper, we propose LipDA, a novel framework for joint LipSync Detection and Attribution, which takes advantage of the inconsistency between head and lip. For detection, the framework learns to quantify this discrepancy by contrasting lip and pose features from authentic versus forged videos. For attribution, our method is designed to capture the unique temporal dynamics and audio-visual synchronization patterns that act as generative fingerprints, enabling source tracing. To validate our approach, we conduct extensive experiments on two challenging LipSync benchmarks as well as on our own proposed large-scale and multi-generator dataset, LipSyncBench-A. LipDA achieves over 97% AUC in detection and 97.5% accuracy in model attribution, significantly outperforming existing methods.