LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling
Daria Ledneva ⋅ Denis Kuznetsov
Abstract
Genomic foundation models increasingly adopt large language model architectures, yet almost universally rely on fixed tokenization schemes such as $k$-mers or BPE, which impose arbitrary sequence boundaries and may obscure biologically relevant structure. Although recent work has demonstrated the feasibility of adaptive hierarchical tokenization in autoregressive settings, its extension to masked language modeling and systematic downstream evaluation remain underexplored. We present \textbf{LDARNet}, a 120M-parameter hierarchical genomic foundation model that integrates learnable sequence compression into the masked language modeling paradigm by combining BiMamba-2 state-space layers with selective attention and ratio-based regularization to induce stable token boundaries without supervision. We evaluate LDARNet by fine-tuning on 27 tasks from the Genomics Benchmarks and Nucleotide Transformer suites, covering regulatory, epigenetic, and sequence-level prediction problems, and compare it against state-of-the-art models ranging from 8M to 2.5B parameters. LDARNet achieves 11 out of 18 wins among models under 300M parameters and attains state-of-the-art performance on 5 histone modification tasks, outperforming even substantially larger models on several long-range epigenetic benchmarks. These results indicate that adaptive hierarchical tokenization under masked language modeling can capture long-range genomic dependencies relevant to regulatory biology and highlight learnable compression as a promising direction for efficient and scalable genomic foundation models.
Successful Page Load