MORE: A Multilingual Document Parsing Benchmark and Evaluation
Long Xu ⋅ Binghong Wu ⋅ TingHao YU ⋅ Hao Feng ⋅ zhenyuhuang ⋅ Haoqing Jiang ⋅ Yunhao Wang ⋅ Shuo Huang ⋅ feng zhang
Abstract
Multilingual documents encapsulate rich regional cultures, scientific discoveries, and historical records. Parsing this content into structured, machine-readable formats is critical for unlocking global knowledge. However, existing benchmarks predominantly focus on high-resource languages like English and Chinese, creating a significant $\textit{evaluation blind spot}$ concerning model performance on the vast spectrum of other languages. While recent Vision-Language Models (VLMs) claim support for hundreds of languages, the lack of comprehensive ground truth makes it impossible to empirically verify these capabilities. To bridge this gap, we introduce $\textbf{MORE}$, a large-scale, linguistically comprehensive benchmark designed for rigorous multilingual document parsing evaluation. MORE distinguishes itself through three key dimensions: (1) $\textbf{Unprecedented Scale}$: It covers $\textbf{149 languages}$, making it the most linguistically diverse benchmark to date; (2) $\textbf{Structural Complexity}$: Unlike previous works, it extends evaluation beyond plain text to include complex structural elements such as code blocks, tables, and catalogs; and (3) $\textbf{Data Authenticity}$: All samples are curated from real-world documents via a rigorous model-assisted, human-refined annotation pipeline. We conduct an extensive evaluation of state-of-the-art models using MORE, establishing new performance baselines for long-tail languages and validating the benchmark's effectiveness in diagnosing model capabilities in realistic, diverse scenarios.
Successful Page Load