BioToken and BioFM – Biologically-Informed Tokenization Enables Accurate and Efficient Genomic Foundation Models
Abstract
Existing genomic foundation models (GFMs) typically treat DNA as raw nucleotide sequences, often overlooking the regulatory context required to interpret genetic variation accurately. We introduce BioToken, a tokenization framework that directly encodes variants and biological annotations into genomic representations, and BioFM, a parameter-efficient model built on this architecture. By leveraging biological inductive biases, BioFM outperforms state-of-the-art models and specialized baselines like Enformer on benchmarks including pathogenicity and expression prediction while requiring 100-fold less compute than current large-scale genomic models. These findings demonstrate that explicitly modeling biological structure yields more robust and efficient genomic representations than scaling alone.