Predicting evolutionary rate as a pretraining task improves genome language model representations
Abstract
Genome language models (gLM) have the potential to further understanding of regulatory genomics without requiring labeled data. Most gLMs are pretrained using sequence reconstruction tasks inspired by natural language processing, but recent studies have shown that these gLMs often fail to capture biological signal. To overcome this, we introduce pretraining tasks that predict the rate of evolution. These tasks are designed so that they can be composed with sequence reconstruction, enabling a controlled comparison of predicting sequence only, evolutionary rate only, or both. To address gaps in existing evaluations, we developed a suite of biologically grounded benchmarks. Across these tasks, and for established variant effect prediction benchmarks, models pretrained on both sequence and evolutionary rate outperform those trained on sequence alone, and training on evolutionary rate can make the even the relatively small models in our work competitive with much larger existing gLMs for some tasks. These results establish evolution as a key training target for genome-scale models.