Addressing Instrument-Outcome Confounding in Mendelian Randomization through Representation Learning
Abstract
Mendelian Randomization (MR) is a prominent observational epidemiological research method, designed to address unobserved confounding when estimating causal effects. It is closely related to instrumental variable (IV) methods, where genetic variants serve as instruments to infer causal relationships from observational data. However, the core assumptions required for valid IV analysis---particularly the independence between instruments and unobserved confounders---are untestable and often violated in practice. In MR, such violations commonly arise when genetic variants are correlated with environmental factors (e.g., population stratification and assortive mating), leading to confounding between instruments and outcomes. At the same time, MR studies increasingly include data collected across multiple environments or populations, providing an opportunity to address these violations. Leveraging this setting, we propose a representation learning framework that exploits multi-environment data to recover latent exogenous components of genetic instruments suitable for causal inference. We provide theoretical insights into when and how the learned components can act as valid instruments, and we demonstrate the effectiveness of our approach through simulations and semi-synthetic experiments using genetic data from the All of Us Biobank.