Skip to yearly menu bar Skip to main content


Poster

LangCell: Language-Cell Pre-training for Cell Identity Understanding

Suyuan Zhao · Jiahuan Zhang · Yushuai Wu · YIZHEN LUO · Zaiqing Nie


Abstract:

Cell identity includes many crucial aspects such as cell type, pathway information, disease information, etc., essentially serving as a label enriched with biological insights. Understanding cell identity from the transcriptomic data is an important task in bioinformatics. The single-cell pre-trained language models (PLMs) currently used for this task have only undergone unsupervised pre-training and lack an understanding of cell identity knowledge. As a result, they have to be fine-tuned for downstream tasks and struggle when lacking labeled data that contains all the target labels. To address this, we propose an innovative solution by constructing a unified representation of single-cell data and natural language during the pre-training phase, allowing the model to directly incorporate insights related to cell identity. More specifically, we introduce LangCell, the first Language-Cell pre-training framework. LangCell utilizes texts enriched with cell identity information to gain a profound comprehension of cross-modal knowledge. Results from experiments conducted on different benchmarks show that LangCell is the only single-cell PLM that can work effectively in zero-shot cell identity understanding scenarios, and also significantly outperforms existing models in few-shot and fine-tuning cell identity understanding scenarios.

Live content is unavailable. Log in and register to view live content