Optimal Domain-Aware Privacy Mechanisms for Synthetic Data Generation
Abstract
Differential privacy (DP) imposes fundamental trade-offs between privacy and statistical fidelity in synthetic data generation. While access to public data has been shown to improve these trade-offs empirically, existing approaches exploit public data only indirectly, through pre-processing (e.g., using pre-trained generative models) or post-processing steps (e.g., matching target statistics estimated from public datasets), while relying on domain-agnostic DP mechanisms. In this work, we lay the theoretical framework to study the principled incorporation of public data into DP mechanisms themselves. We consider normalized histograms as distribution estimators and characterize the \emph{theoretically optimal} domain-aware privacy mechanism within a class of mixing-based DP mechanisms. We introduce \textsc{PubMix}, a public-data-aware DP mechanism that can be used in histogram-based data synthesis pipelines. Our experiments demonstrate that, when public data is available, \textsc{PubMix} significantly improves synthetic data generation quality across tasks without compromising privacy.