Very Efficient Listwise Multimodal Reranking for Long Documents
Abstract
Listwise reranking is a critical yet costly component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. Although recent VLM-based rerankers achieve strong accuracy, they are often impractical due to long visual-token inputs and autoregressive decoding, resulting in high latency. We propose ZipRerank, a very efficient listwise multimodal reranker that directly addresses both bottlenecks: it shortens the input via query-image early interaction and eliminates multi-step generation by scoring all candidates in a single forward pass. ZipRerank is trained with a two-stage recipe: listwise pretraining on large-scale text reranking data rendered as images, followed by multimodal finetuning with VLM-teacher supervision and a soft-ranking objective to handle noisy rankings. Extensive experiments on the MMDocIR benchmark demonstrate that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. Source code is available at https://anonymous.4open.science/r/ZipRerank.