OvisOCR: End-to-End Document Parsing via Aligning Specialized Perception with General Reasoning
Abstract
This paper presents OvisOCR, a lightweight and strictly end-to-end Multimodal Language Model (MLLM) tailored for document parsing. Unlike current methods that rely on complex "Crop-OCR-Merge" cascades to handle high-resolution inputs, OvisOCR directly maps full-page visual signals to structured Markdown without localized slicing or layout detection dependencies. Through extensive evaluations on the OmniDoc benchmark, OvisOCR achieves SOTA performance, demonstrating that a compact E2E model can effectively ``digest'' the capabilities of intricate pipelines and surpass specialized and general methods. Technically, OvisOCR establishes a holistic paradigm that synergizes specialized perception with general reasoning, distilling fine-grained recognition from OCR engines and semantic correction from LLMs into a unified model. To balance the performance across diverse document constituents, we design category-specific reward mechanisms for distinct element types, such as dense text, complex tables, and formulas, and ensure the model enhances its formatting strengths for each modality concurrently. This approach effectively resolves the optimization conflict, guaranteeing that improvements in structural layout parsing do not come at the expense of omitting fine-grained textual details. Empirical results confirm that OvisOCR eliminates the error propagation inherent in split-and-merge architectures, offering a streamlined path for next-generation document intelligence.