AugMask: Score-Based Generative Modeling of Incomplete Tabular Data via Augmentation and Masking
Abstract
Score-based diffusion models have emerged as prominent deep generative models; however, their application to tabular data remains challenging because their backbones assume fully specified inputs, whereas real-world tabular data often contain missing values. We propose AugMask, a plug-and-play training framework that adapts missing-unaware backbones to incomplete data via stochastic regularization. AugMask 1) completes inputs via conditional stochastic augmentation using lightweight auxiliary models and 2) masks the loss, using augmented missing entries for conditioning while restricting supervision to observed coordinates. We connect AugMask to a Rao-Blackwellized objective and show that marginalizing missing entries yields a variance-weighted sensitivity penalty, promoting invariance of observed-coordinate reconstruction with respect to uncertain missing entries. Across diverse datasets and missingness regimes, AugMask enables standard diffusion-based tabular generators to match or outperform specialized missing-aware baselines in both sample fidelity and downstream utility. The code will be released.