GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts
Amir Hossein Kargaran ⋅ Nafiseh Nikeghbal ⋅ Jana Diesner ⋅ François Yvon ⋅ Hinrich Schuetze
Abstract
OCR has improved quickly with vision-language models, but evaluation still focuses on a small set of high- and mid-resource scripts. We introduce GlotOCR Bench, a benchmark for OCR generalization across 100+ Unicode scripts, using clean and degraded images rendered from real multilingual text with Google Fonts, HarfBuzz, and FreeType. Evaluating both open and proprietary models, we find that most work well on fewer than ten scripts, and even the best models generalize to fewer than thirty. OCR performance relies heavily on script coverage in pretraining and visual recognition, with unfamiliar scripts often yielding noise or hallucinated lookalikes. We release the benchmark and rendering pipeline for reproducibility.
Successful Page Load