Position: VLM Causal Reasoning Benchmarks Should Probe Temporal Understanding, Not Presume It
Abstract
This position paper argues that vision-language model (VLM) benchmarks for causal reasoning rely on two under-examined assumptions. First, benchmarks presuppose temporal constitution, the understanding of time as the medium through which causes produce effects, without testing it as a prerequisite. Second, they insufficiently distinguish external symbolic scaffolding from internalized capability; scaffolding-invariance is the diagnostic signature of genuine internalization. Drawing on frameworks from art, philosophy, and psychoanalysis, we propose diagnostics that probe these foundations. Preliminary evidence from three VLMs shows systematic disparity between fluent causal text and valid causal structure, and qualitatively different responses to identical scaffolding manipulation. None of these patterns indicates constitutive internalization. Progress requires benchmarks that test temporal understanding and scaffolding-invariance, not only output accuracy.