FedPDG: Prediction Discrepancy–Guided Data Generation for Heterogeneous Federated Learning
Abstract
One emerging approach to mitigating data heterogeneity in Federated Learning (FL) is to employ diffusion models to generate synthetic data for clients, thereby aligning local data distributions with the global distribution. Prior work has primarily focused on balance-oriented augmentation, which assumes a balanced global class distribution and thus generates samples of rare classes to rebalance each client's local dataset. However, in practice, global data distributions are often inherently imbalanced. Moreover, privacy constraints in FL hinder the server’s ability to accurately estimate the global distribution, rendering balance-oriented augmentation suboptimal. This raises a key, underexplored challenge: How can synthetic data be generated and selected to align local distributions with the true, yet unknown, global distribution? Our key insight is that a model’s performance implicitly reflects the data distribution it has been trained on. Based on this observation, we use the performance discrepancy between local and global models to identify the regions where each client’s local dataset is lacking, and generate corresponding samples for clients. Furthermore, we adapt the diffusion model via preference optimization, enabling it to generate data that better aligns with the true global distribution. Extensive experiments on multiple benchmarks demonstrate that FedPDG outperforms state-of-the-art methods, achieving up to 3.82\% improvement.