Beyond Statistical Fidelity: Causal-Valid Synthetic EHR Generation for Low-Resource Clinical AI
Abstract
The scarcity of labeled EHRs limits clinical foundation models in low-resource settings. Existing synthetic data generators rely on statistical fidelity metrics that fail to capture clinical validity, often producing biologically implausible patient populations. We propose DataSynK, a causal-symbolic framework that integrates causal discovery, medical ontologies, and logical constraints to generate structurally valid synthetic EHRs. Experiments on Brazilian clinical data reveal a strong dissociation between statistical fidelity and clinical validity, showing that DataSynK achieves superior ontological validity and downstream classification utility. Our results suggest that structural validity should become a core evaluation criterion for trustworthy synthetic clinical data generation.