IndiaCultureVL-Bench: Cultural Recognition and Prompt-Induced Collapse in Open Vision-Language Models
Abstract
We introduce INDIACULTUREVL-BENCH, a targeted benchmark of 100 synthetically generated images spanning 25 culturally distinctive Indian categories (food, festivals, decorative arts, textiles, music, dance, transport, monuments), paired with a 14-prompt suite that varies persona, locale, position, and identity framing. We evaluate four open vision-language models on item naming and country attribution: InternVL3-1B, Gemma3-4B, LFM 2.5 VL, and Qwen3-VL 2B. Country attribution is generally strong (means 84–99.9%), but item naming collapses on classical instruments (veena, tabla), dance forms (kathak), and signature dishes (lassi, butter chicken, tandoori chicken), with per-model means ranging from 45% to 84%. Crucially, prompt slots induce systematic, model-specific failures: a trailing non-Indian name (Yuki) flips InternVL’s country accuracy from 93% to 9% (a controlled ablation in Section D attributes most of this drop to trailing-signature format and a smaller residual to the non-Indian cultural marker); a non-Indian-diaspora persona drives LFM to relabel Indian items as Japanese (tempura, kimono, kochi lamp, sakura mochi); and any name prefix triggers Qwen3-VL refusals (often with self-identification as “I’m Qwen, not Yuki”). We release the benchmark, the variant ontology that generates the prompt suite (six perturbation axes; reusable for new languages and cultures), and per-prompt outputs to support culturally grounded VLM evaluation.