Poster Wed, Jul 8, 2026 • 5:00 PM – 6:45 PM KST Coex: HALL A

GeoReward: Mitigating Contextual Variable Overestimation in Vision-Language Models for Cross-Market Preference Prediction

Shuo Liu ⋅ Huixiang.Cai ⋅ Weiru Zhang ⋅ Xiaoyi Zeng

Abstract

Vision-language models (VLMs) excel in many multimodal tasks but remain prone to a subtle yet impactful failure mode: they tend to overestimate dominant visual-textual cues while underestimating sparse but decision-critical contextual variables. This issue, which we term Contextual Variable Overestimation (CVE), becomes particularly evident in real-world applications such as predicting advertisement image preferences across diverse geographic markets. For instance, when a VLM (e.g., Qwen2-VL) is asked to choose between two product images tailored for different countries (e.g., Korea vs. France), it often defaults to a consistent output (e.g., always selects “A”), ignoring ground-truth regional variations. This collapse occurs because pervasive high-volume signals, such as product attributes and dense image patches, overwhelm the few but critical tokens that encode market-specific context (e.g., country names). To address CVE, we first collect a new multimodal dataset of real advertising creatives and their click-through performance across multiple countries. We then introduce GeoReward, a reward model designed to predict ad image preferences across diverse geographic markets. GeoReward integrates three purpose-built mechanisms: (1) Market-Aware Retrieval Augmentation, which retrieves and injects region-aligned preference signals during training to sharpen localization awareness. (2) Context-Guided Visual Modulation, a lightweight adapter that dynamically adjusts visual representations using textual country embeddings, enabling fine-grained regional adaptation. (3) Selective Sensitivity Loss, an objective that applies heightened penalties for context-specific mispredictions, sharpening the model's focus on critical variables. Furthermore, we demonstrate how GeoReward can guide the fine-tuning of RL for a VLM to generate background designs for text-to-image models (e.g., SDXL), producing market-aware advertising creatives. Experiments validate that our framework mitigates CVE and outperforms existing baselines. This work not only diagnoses a systematic bias in VLMs toward dominant perceptual features but also delivers a targeted solution for applications where sparse contextual variables govern decision-making.