DataGuard: A Non-intrusive Dataset Auditing Framework via Differential Information Forensics
Abstract
Concerns over dataset misuse in deep learning have highlighted the need for effective auditing. Unlike existing intrusive methods that require dataset modifications, which risk model performance and security, we present DataGuard, a non-intrusive framework for quantitative dataset auditing. Specifically, DataGuard integrates three key components: 1) a differential comparison between the target dataset and auxiliary non-training datasets, 2) an information-forensic analysis establishing formal inequalities to distinguish training data; and 3) a multivariate statistical test that translates these discrepancies into rigorous auditing scores. Extensive experiments demonstrate that DataGuard can detect both full and partial dataset usage without false positives while remaining robust under diverse training scenarios, offering a principled, information-theoretic solution for transparent AI development.