SONAR: Spectral‑Contrastive Audio Residuals for Generalizable Deepfake Detection
Ido Nitzan Hidekel ⋅ Gal Lifshitz ⋅ Khen Cohen ⋅ Dan Raviv
Abstract
Deepfake audio detectors often fail to generalize to unseen attacks, in part due to \emph{spectral bias}: neural networks prioritize low-frequency structure while under-exploiting subtle high-frequency (HF) artifacts left by generative models. We introduce \textbf{SONAR} (Spectral-cONtrastive Audio Residuals), a frequency-guided framework that \emph{explicitly enforces representation-level consistency} between semantic content and HF residuals. Unlike prior frequency-aware or dual-stream detectors that treat HF cues as auxiliary features, SONAR encourages structured interaction between content and noise representations in latent space. The model employs a dual-path architecture in which an XLSR encoder captures low-frequency content, while a parallel branch with learnable, value-constrained SRM high-pass filters distills HF residuals. The two representations are fused via frequency cross-attention and trained with a \emph{Jensen--Shannon alignment loss} that promotes LF–HF consistency for genuine audio and amplifies inconsistency for deepfakes. Evaluated on ASVspoof~2021 and in-the-wild benchmarks, SONAR achieves state-of-the-art performance in a \textbf{single run} setting and converges up to \textbf{4$\times$ faster} than strong baselines. By mitigating the effects of spectral bias through frequency-guided alignment, SONAR provides a fully data-driven and architecture-agnostic approach to generalizable audio deepfake detection.
Successful Page Load