D$^2$O: A Dual Debiasing Operator for Training-Free Test-Time Adaptation of Vision–Language Models
Yihong Luo ⋅ Wenwu He ⋅ Dong Liang ⋅ Yihang Zhou ⋅ Zhuo-Xu Cui
Abstract
Training-free test-time adaptation (TTA) for vision-language models (VLMs) can boost zero-shot classification under mild shifts but often collapses under severe environment/style shifts. We identify two shared failure modes: (i) retrieval confounding, where feature similarity is dominated by style and corrupts cache/bank evidence; and (ii) environment-biased priors, where VLMs logits exhibit environment-dependent centered shifts that distort gating and prior-like terms. We propose D$^2$O, a strictly training-free debiasing operator that outputs three inference objects per test sample: a content feature for reliable retrieval, a style fingerprint for environment routing, and debiased logits for corrected priors. D$^2$O composes plug-and-play with cache-based and closed-form Gaussian adapters in both online and transductive settings. We further provide operator-to-decision guarantees: finite-difference covariance recovers a style subspace, style-routed EMA controls the centered logit-bias estimate, and these errors translate to bounded posterior log-odds perturbations, yielding a margin-based condition for label invariance under strong shifts. Extensive experiments on diverse benchmarks show that our method consistently achieves state-of-the-art performance across a broad range of distribution shifts.
Successful Page Load