Rethinking Genomic Modeling Through Optical Character Recognition
Hongxin Xiang ⋅ Pengsen Ma ⋅ Yunkang Cao ⋅ Di Yu ⋅ Haowen Chen ⋅ Xinyu Yang ⋅ xiangxiang Zeng
Abstract
Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present \textsc{OpticalDNA}, a vision-based framework that reframes genomic modeling as OCR-style document understanding. \textsc{OpticalDNA} renders DNA into structured visual layouts and trains an OCR-capable vision--language model with a \emph{visual DNA encoder} and a \emph{document decoder}, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, \textsc{OpticalDNA} defines prompt-conditioned objectives over core genomic primitives—reading, region grounding, subsequence retrieval, and masked span completion—thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, \textsc{OpticalDNA} consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly $20\times$ fewer effective tokens, and surpasses models with up to $985\times$ more activated parameters while tuning only 256k \emph{trainable} parameters.
Successful Page Load