Language Model Augmented Semi-Supervised Statistical Inference
Abstract
Semi‑supervised statistical inference plays a key role in biomedical research, where labeled data often have higher quality but are limited due to costly clinical annotation. Yet, existing semi‑supervised statistical inference methods rely heavily on structured variables and strictly matched covariates between labeled and unlabeled datasets -- limitations ill‑suited for the heterogeneity and unstructured nature of real-world biomedical data. Modern biomedical studies increasingly collect unstructured data (clinical notes, patient audio and video recordings), with inconsistent protocols across datasets causing covariate misalignment (for instance, detailed medication histories may be recorded in one study but not another). Recent advances in pre‑trained multimodal large language models (LLMs), which excel at handling unstructured data, present an attractive potential solution. To transform this potential into rigorous semi-supervised statistical inference methods for biomedical research, two key challenges must be addressed: (1) How can we reliably integrate LLMs to enhance semi-supervised inference efficiency without compromising statistical validity? (2) How can those efficiency gains persist despite mismatched covariates between labeled and unlabeled datasets? In this paper, we tackle these challenges by systematically calibrating pseudo-labels provided LLMs with a novel prediction-invariance identification strategy. Our resulting semi‑supervised inference framework improves parameter estimation efficiency while maintaining full statistical validity, as demonstrated through our theoretical results and illustrated in a case study for identifying key biomarkers in Alzheimer’s disease detection with speech data.