A Progressive Evidence Localization Framework Based on Wasserstein Gradient Flows for Document Visual Question Answering
Abstract
Precise evidence region localization in Document Visual Question Answering (DocVQA) is crucial for improving model interpretability and reliability. However, most existing approaches rely on single-step localization, which struggles to effectively distinguish true evidence from irrelevant content when page semantics are complex or evidence regions are extremely small, leading to ambiguous boundaries and localization errors. To address these challenges, we propose a progressive evidence localization framework based on Wasserstein gradient flows, which reformulates evidence localization as an optimal transport optimization problem over probability distributions. Since continuous-time gradient flows are intractable in practice, we adopt the Jordan--Kinderlehrer--Otto (JKO) scheme for discrete optimization and derive an end-to-end trainable loss function that translates the theoretical framework into a neural network–optimizable objective. This formulation enables precise evidence localization through progressive refinement from coarse-grained to fine-grained regions. Experimental results demonstrate that our method significantly outperforms existing approaches in both evidence localization and answer generation, while providing an interpretable progressive reasoning process.