ModernVBERT: Towards Smaller Visual Document Retrievers
Abstract
Large-scale document retrieval (search) is key in many modern industrial AI pipelines to ground models with relevant contextual information. Increasingly, Visual Document Retrieval (VDR) models, which directly embed images of document pages, are used as an alternative to text-only retrievers. While these models are historically repurposed generative VLMs fine-tuned for embedding tasks, we revisit this design choice in this paper and systematically develop strong VDR models from the ground up. Through controlled experiments, we isolate the impact of key training factors such as attention masking, multi-modal data regimes, and contrastive objectives at all phases of training. Our findings confirm current VDR performance is constrained by generative modeling, especially in multi-vector settings. Building on these insights, we train ModernVBERT, a 250M-parameter vision-language encoder that outperforms recent models up to 10 times its size when fine-tuned on document retrieval tasks. Thanks to its compact design, ModernVBERT enables efficient retrieval inference on CPU hardware, while maintaining competitive performance. Models, code and data are available in the public version of this work.