BIOARC: Discovering Optimal Neural Architectures for Biological Foundation Models
Abstract
Foundation models have revolutionized AI, yet biological applications often repurpose general architectures without accounting for the intrinsic structural and functional properties of distinct modalities, such as genomic and proteomic sequences. Consequently, these architectures lack the inductive biases required to capture the complex ``grammars" inherent to biological data, resulting in suboptimal performance. To address this, we introduce BioArc, a framework utilizing Neural Architecture Search (NAS) to shift from intuition-driven design to automated data-driven discovery. Unlike standard NAS restricted to homogeneous spaces, BioArc navigates a heterogeneous space for open-ended composition of architectural blocks. By systematically analyzing the interplay between architecture, tokenization, and training across modalities, BioArc identifies novel hybrid architectures that surpass state-of-the-art models while being up to 25x smaller. We distill these findings into empirical design principles and validate their biological relevance, demonstrating how our designs hierarchically capture the underlying biological grammar. Additionally, we introduce an agentic framework to predict optimal architectures for new tasks. Overall, BioArc provides a data-driven methodology for developing the next generation of efficient biological foundation models and task-specific networks.