The Abstraction Gap in Vision-Language Causal Reasoning
Abstract
Vision-language models (VLMs) generate fluent causal explanations for visual scenes, but does this fluency reflect genuine structural understanding? We address this question through a dual-probe methodology that isolates plausibility from faithfulness. The Text-Only Probe measures linguistic quality; the Chain-Text Probe requires models to first generate explicit causal chains before text responses. We define the Abstraction Gap (AG) metric as the normalized performance difference between probes, operationalizing the plausibility-faithfulness distinction from explainable AI research. Applying this methodology to eight VLMs using CAGE (Causal Abstraction Gap Evaluation), a benchmark of 49,500 questions across 5,500 images spanning Pearl's causal hierarchy, we find seven models exhibit AG exceeding 0.50: scoring 6--8 on text but below 2.5 on chains, often producing blank outputs. Fine-tuning on 45,000 chain-annotated examples fails to close the gap, indicating that explicit chain supervision cannot instill structural abstraction capability. Current VLMs optimize for plausible language without faithful structural understanding.