Seeing Symbols, Missing Structure: A Real-World Handwritten Mathematical Expression Recognition Benchmark for Large Models
Abstract
Handwritten mathematical expression recognition (HMER) remains challenging in real-world educational scenarios, even with recent advances in large vision-language models. While these models often achieve high accuracy in local symbol transcription, their reliability in capturing two-dimensional mathematical structure under realistic handwritten conditions is still poorly understood. We introduce a real-world handwritten benchmark covering 13 categories of structurally complex expressions with authentic writing artifacts. Evaluations on large models reveal a clear performance degradation as structural complexity increases, even when symbol-level accuracy is high. Most failures arise from structural mis-parsing and context-dependent symbol role confusion rather than pure visual perception errors. To mitigate this issue, we propose a training-free, schema-anchored structure-aware inference framework that decomposes recognition into schema identification, schema-constrained transcription, and context-driven disambiguation. Our method improves the ExpRate from 11.63\% to 24.52\% on Qwen-8B and generalizes well across multiple large models. Our benchmark provides a realistic evaluation for large models on handwritten mathematics, and our framework offers an effective and interpretable solution to structure-related failures in real-world HMER.