Poster Tue, Jul 7, 2026 • 6:30 PM – 8:15 PM PDT HALL A #2709

BioToken and BioFM – Biologically-Informed Tokenization Enables Accurate and Efficient Genomic Foundation Models

Aleksandr Medvedev ⋅ Karthik Viswanathan ⋅ Praveenkumar Kanithi ⋅ Kirill Vishniakov ⋅ Prateek Munjal ⋅ Clement Christophe ⋅ Tiago Magalhaes ⋅ Marco Pimentel ⋅ Ronnie Rajan ⋅ SHADAB KHAN

Abstract

Existing genomic foundation models (GFMs) typically treat DNA as raw nucleotide sequences, often overlooking the regulatory context required to interpret genetic variation accurately. We introduce BioToken, a tokenization framework that directly encodes variants and biological annotations into genomic representations, and BioFM, a parameter-efficient model built on this architecture. By leveraging biological inductive biases, BioFM outperforms state-of-the-art models and specialized baselines like Enformer on benchmarks including pathogenicity and expression prediction while requiring 100-fold less compute than current large-scale genomic models. These findings demonstrate that explicitly modeling biological structure yields more robust and efficient genomic representations than scaling alone.