Scaling Laws in Model Fine-tuning for Audio DeepFake Detection
Abstract
Recent advances in audio deepfake detection have been driven by increasingly large speech foundation models and growing amounts of synthetic data. Despite steady improvements on different benchmarks, it remains unclear how detection performance scales with model capacity and training data under realistic deployment conditions, where detectors operate under distribution shift, signal corruption, and unseen synthesis pipelines. In this work, we present the first systematic study of scaling laws in post-training audio deepfake detection, focusing on fine-tuning regimes rather than large-scale pretraining. Using a controlled family of speech foundation models with shared architecture and pretraining, we analyze how detection performance, robustness, and generalization evolve as a function of model size and training data scale. Our evaluation covers multiple dimensions, including out-of-distribution datasets, common audio corruptions, cross-language generalization, and cross-TTS (Text-to-Speech) generalization to unseen speech synthesis systems. Across settings, we observe consistent but highly non-uniform scaling behavior: while larger models are more sample-efficient and generalize better overall, scaling benefits weaken under corruptions and linguistic shift, and persistent error gaps remain even at the largest scales. Our results reveal a fundamental asymmetry between performance scaling and robustness scaling in audio deepfake detection. While larger detectors consistently improve in-distribution detection performance, gains in robustness and generalization, particularly under cross-language and cross-TTS evaluation, are substantially weaker and exhibit persistent error gaps.