MyHealthDL: A Prior-Seeded Synthetic Malaysian Clinical Dataset for Multi-Task Deep Learning in Data-Scarce Healthcare Settings
Abstract
Malaysia’s Personal Data Protection Act 2010 restricts access to clinical records, creating a structural barrier to health AI research. We introduce MyHealthDL, a 10,000-record synthetic Malaysian clinical tabular dataset built through a two-stage pipeline: a parametric seed sampled from published NHMS 2019, MOH 2022/2023, and MDTR 2022 statistics, followed by CTGAN refinement to capture inter-feature correlations. Labels derive from Malaysian clinical practice guidelines (CPGs), making experiments a test of guideline-fidelity learning rather than clinical generalisation. We introduce a rule-based ceiling baseline, characterise the bias-variance trade-off of conditional oversampling for rare comorbidity structures, evaluate downstream utility via TSTRNP with dual probe learners (RF and XGBoost), and report ethnicity-stratified fidelity and fairness metrics. Code and data will be released upon acceptance (withheld for anonymous review).