Context-Aware Reaonser : Enhancing Contextual Reasoning in Multimodal Large Language Models
Abstract
Multimodal large language models (MLLMs) have demonstrated remarkable reasoning capabilities over internalized knowledge. However, current research overlooks contextual reasoning, the ability to reason based on the relevant information present in the context. To investigate this issue, we construct the Visual Contextual Reasoning Benchmark (ContextReasonV-Bench), and our analysis reveals two predominant failure modes: \textit{context neglect}, where models rely on pre-trained knowledge instead of contextual information, and \textit{superficial pattern matching}, where models exploit shallow correlations rather than genuine patterns. To address these limitations, we propose a two-stage approach that progressively establishes and reinforces contextual pattern acquisition. The first stage establishes an "analyze-then-solve" reasoning paradigm through supervised fine-tuning (SFT). We then employ a context-aware reinforcement learning (RL) framework that integrates context-aware reward modeling with hierarchical advantage estimation to encourage the model to identify genuine contextual patterns. This approach yields Context-Aware Reasoner (CAR), a model that achieves 38.14\% accuracy on ContextReasonV-Bench, improving the base model by 22.09\%. Notably, CAR exhibits strong generalization to tasks not seen during training, confirming that our approach enhances genuine contextual reasoning capability.