The interpretation of complex high-dimensional data typically requires the use of dimensionality reduction techniques to extract explanatory low-dimensional representations. However, these representations may not be sufficient or appropriate to aid interpretation particularly where dimensionality reduction is achieved through highly non-linear transformations. For example, in transcriptomics, the expression of many thousands of genes can be simultaneously measured and low-dimensional representations developed for visualisation and understanding groupings of coordinated gene behaviour. Nonetheless, the underlying biology is ultimately physically driven by variation at the level of individual genes and we would like to decompose that expression variability into a number of meaningful sub-components using a nonlinear alternative to traditional mixed model regression analysis.
Gaussian Process Latent Variable Models (GPLVMs) offer a principled way of performing probabilistic non-linear dimensionality reduction and can be extended to incorporate additional covariate information that is available in real-life applications. For example, in transcriptomics, covariate information might include categorical labels (e.g. denoting known disease sub-populations), continuous-valued measurements (e.g. biomarkers), or censored information (e.g. patient survival times). However, the objective of such extensions in previous works has often been to boost predictive or classification power of the GPLVM. For example, the supervised GPLVM, uses class information to effectively build a distinct GPLVM for each class of data. Our motivation is discovery-led and we wish to understand the nature of the feature-level variability, separating the covariate effects from the contribution of latent variables, e.g. to identify sets of features which are fully explained by covariates. We principally do this in a high-dimensional observations setting where the number of features is vastly greater than the number of known covariates.
In this paper, we propose the Covariate Gaussian Process Latent Variable Model (c-GPLVM) to achieve this through a structured sparsity-inducing kernel decomposition for the GPLVM which allows us to explicitly disentangle variation in the observed data vectors induced by variation in the covariate inputs or latent variables and interaction effects where the covariate inputs act in concert with the latent variables. The novelty of our approach is that the structured kernel permits both the development of a nonlinear mapping into a latent space where confounding factors are already adjusted for and feature-level variation that can be deconstructed.
We demonstrate the utility of this model on a number of simulated examples and applications in disease progression modelling from high-dimensional gene expression data in the presence of additional phenotypes. In each setting we show that the c-GPLVM is able to effectively extract low-dimensional structures from high-dimensional data sets whilst allowing a breakdown of feature-level variability that is not present in other commonly used dimensionality reduction approaches.
Kaspar Märtens (University of Oxford)
Kieran Campbell (University of British Columbia)
Christopher Yau (University of Birmingham)
Related Events (a corresponding poster, oral, or spotlight)
2019 Poster: Decomposing feature-level variation with Covariate Gaussian Process Latent Variable Models »
Wed Jun 12th 01:30 -- 04:00 AM Room Pacific Ballroom