DocHop: Benchmarking Out-of-domain Multi-hop Reasoning in Information-Dense Documents
Abstract
Multimodal Large Language Models (MLLMs) have achieved strong performance on structured visual understanding tasks such as chart and document question answering. However, existing benchmarks typically evaluate these domains in isolation, overlooking realistic settings where numerical evidence in charts must be interpreted through surrounding narrative context. We introduce DocHop, a benchmark for integrated chart--context reasoning in document-style images. In DocHop, the document narrative specifies multi-step compositional constraints, while charts provide the corresponding data values. Questions are grounded on a semantic reference label defined in the narrative, requiring models to resolve target entities from context before aggregating evidence across multiple charts. To enable systematic evaluation, we construct DocHop via a stochastic logic-first generation pipeline with controllable reasoning depth and visual density, covering 1,876 examples across six task categories. Experiments on a wide range of proprietary and open-sourced MLLMs show a substantial gap to human performance: annotators achieve over 90\% accuracy, while the best model reaches only 60.18\%. Reasoning-enhanced models consistently show improved results, but the performance degrades as reasoning complexity increases. Overall, DocHop provides testbed for challenging multi-hop document reasoning.