Interpreting Genomic Language Models using Sparse Autoencoders
Abstract
Genomic language models (gLMs) achieve strong performance across diverse genomic prediction tasks, but their internal biological representations remain poorly understood. Sparse autoencoders (SAEs) have emerged as an interpretability tool in vision and natural language models, yet their applicability to gLMs remains unexplored. We present a systematic study of SAE-based interpretability for gLMs, introducing a diverse benchmark of human genomic annotations and a suite of genome-tailored interpretability metrics. Using Evo2 as a primary case study, we show that SAE features, particularly those from intermediate layers, are more interpretable than raw model embeddings across 42/55 (76\%) of our genomic concept evaluations, with 26 of them having an F1 score greater than 0.7. We further find that interpretability depends on SAE training data properties such as evolutionary proximity and context length, with mixed-species and longer-context training improving recovery of human genomic features. Finally, we develop a graph-based representation method to construct a feature atlas that organizes semantically related genomic concepts learned by an SAE, outperforming the baseline approach of using SAE model weights. Our results establish SAEs as a powerful framework for better understanding gLMs, broadening their accessibility and utility for disease-driven genomic analysis.