Do-Prompt: Causal Interventions Meet Variational Prompt Bottlenecks
Abstract
Multi-modal prompt learning is a parameter-efficient approach to adapt large vision--language models to downstream classification tasks. However, prompts can inadvertently evolve into a high-capacity pathway encoding environment-dependent spurious correlations that are only predictive in the source domain, thereby undermining transferability. To address this issue, this paper introduces \textbf{Do-Prompt}, a \emph{compress-and-intervene} framework that brings together variational bottlenecks and causal interventions for robust prompt tuning. We model prompts as stochastic latent variables and impose a \emph{variational prompt bottleneck} to explicitly regulate the information transmitted through prompts, effectively mitigating their propensity to memorize spurious nuisance cues. Building on this capacity constraint, we propose lightweight \emph{prompt-level interventions} by perturbing the environment-related prompt components and enforcing prediction consistency under these \textit{do}-style perturbations. This synergistic integration encourages reliance on task-stable, invariant semantics rather than spurious prompt content. Notably, Do-Prompt is plug-and-play compatible with existing multi-modal prompt tuning pipelines with negligible computation overhead. Extensive experiments on base-to-novel generalization, cross-dataset transfer, and ImageNet distribution shifts demonstrate consistent performance gains, with particularly notable improvements on datasets exhibiting pronounced domain or texture biases.