Poster
Cell2Sentence: Teaching Large Language Models the Language of Biology
Daniel Levine · Syed Rizvi · Sacha Lévy · Nazreen Pallikkavaliyaveetil MohammedSheriff · David Zhang · Xingyu Chen · SINA GHADERMARZI · Ruiming Wu · Zihe Zheng · Ivan Vrkic · Anna Zhong · Daphne Raskin · Insu Han · Antonio Henrique de Oliveira Fonseca · Josue Ortega Caro · Amin Karbasi · Rahul Dhodapkar · David van Dijk
Hall C 4-9 #315
We introduce Cell2Sentence (C2S), a novel method to directly adapt large language models to a biological context, specifically single-cell transcriptomics. By transforming gene expression data into "cell sentences," C2S bridges the gap between natural language processing and biology. We demonstrate cell sentences enable the fine-tuning of language models for diverse tasks in biology, including cell generation, complex cell-type annotation, and direct data-driven text generation. Our experiments reveal that GPT-2, when fine-tuned with C2S, can generate biologically valid cells based on cell type inputs, and accurately predict cell types from cell sentences. This illustrates that language models, through C2S fine-tuning, can acquire a significant understanding of single-cell biology while maintaining robust text generation capabilities. C2S offers a flexible, accessible framework to integrate natural language processing with transcriptomics, utilizing existing models and libraries for a wide range of biological applications.