Workshop Poster
in
Workshop: ICML 2021 Workshop on Computational Biology
Reference-free cell type annotation and phenotype characterisation in single cell RNA sequencing by learning geneset representations
Soroor Hediyeh-zadeh
In recent years the widespread availability of single cell RNA sequencing (scRNA-seq) technology has led to the curation and distribution of diverse and comprehensive cell atlases and reference datasets, where cell identities are annotated by expert biologists. Many probabilistic and non-probabilistic solutions were developed that utilise the existing annotated datasets as “reference” to characterise cells in newly acquired datasets in supervised or unsupervised ways. What these methods are unable to do, however, is to characterise molecular phenotypes that are best studied by bulk RNA sequencing or microarray technologies that precede scRNA-seq. Examples include breast cancer subtypes that are determined using a defined set of fifty genes, the PAM50 gene signatures. In fact, several geneset and cell marker databases have emerged from published bulk and single cell RNA sequencing studies, where phenotypes, transcriptional programs (e.g. signaling pathways) and cell types are only described by a list of genes, with no numerical attributes. We are interested in the problem of mapping phenotype and cell type similarities in single cell RNAseq from a collection of genesets or makers. We developed scDECAF, which uses vector space model to label cells in a dataset. Gene lists are mapped to a common, shared latent space with single cell gene expression profiles where the correlation between expression profile of the cells and the pattern defined by the genesets is maximised. The latent spaces are determined using Canonical Correlation Analysis (CCA). The association between the cells and genesets is determined by the proximity of their representations in the CCA space and the transcriptome embedding space, resulting in annotation of the cells and associations with phenotypes. We have additionally developed a framework for selection of biologically relevant genesets when large geneset collections are examined. Our results suggest that scDECAF has comparable performance to reference-based cell type annotation methods, and it able to recover the known transcriptional programs in scRNAseq datasets.