When Sample Selection Bias Precipitates Model Collapse
Abstract
The proliferation of recursive synthetic data training promises to alleviate data scarcity but introduces the existential risk of model collapse, wherein recursive training on synthetic data erodes distributional tails and homogenizes outputs. Current literature identifies data selection as a pivotal solution, employing verifiers to prune datasets in pursuit of synthetic samples that approximate the true data manifold. However, this approach hinges on the fragile and often unrealistic assumption that a perfect verifier possesses global distributional knowledge. In real-world scenarios characterized by data silos, such as fragmented healthcare consortia or proprietary financial institutions, this assumption is invalidated by the inherent fragmentation of knowledge. We theoretically prove that such siloed selection accelerates model collapse, driving diversity decay governed by a power law. To bridge this gap, we propose an automated filtering criterion that synergizes the sensitivity theorem with Wasserstein geometry. Specifically, multiple parties collaboratively compute geodesic interpolations and the Wasserstein Barycenter as proxy measures, without exchanging raw data. These proxies serve as a collective reference, enabling multiple parties to score synthetic data rather than relying on a single biased perspective in a data silo. Empirical results show the failure of the baseline on skewed distributions, whereas our methods effectively prevent collapse. Code available at Anonymous Github.