Skip to yearly menu bar Skip to main content


Poster

Cell2Sentence: Teaching Large Language Models the Language of Biology

Daniel Levine · Sacha Lévy · Syed Rizvi · Nazreen Pallikkavaliyaveetil MohammedSheriff · Xingyu Chen · Zhang · Ivan Vrkic · SINA GHADERMARZI · Ruiming Wu · Zihe Zheng · Antonio Henrique de Oliveira Fonseca · Josue Ortega Caro · Insu Han · Anna Zhong · Daphne Raskin · Amin Karbasi · Rahul Dhodapkar · David van Dijk


Abstract:

We introduce Cell2Sentence (C2S), a novel method to directly adapt large language models to a biological context, specifically single-cell transcriptomics. By transforming gene expression data into "cell sentences," C2S bridges the gap between natural language processing and biology. We demonstrate cell sentences enable the fine-tuning of language models for diverse tasks in biology, including cell generation, complex cell-type annotation, and direct data-driven text generation. Our experiments reveal that GPT-2, when fine-tuned with C2S, can generate biologically valid cells based on cell type inputs, and accurately predict cell types from cell sentences. This illustrates that language models, through C2S fine-tuning, can acquire a significant understanding of single-cell biology while maintaining robust text generation capabilities. C2S offers a flexible, accessible framework to integrate natural language processing with transcriptomics, utilizing existing models and libraries for a wide range of biological applications.

Live content is unavailable. Log in and register to view live content