USE : A Unified Self-Ensembling Framework for Test-Time Prompt Tuning
Abstract
Test-time adaptation (TTA) has emerged as a popular paradigm for improving the performance of vision–language models (e.g., CLIP) on downstream tasks. Among existing CLIP-based TTA methods, Test-Time Prompt Tuning (TPT) is a pioneering work that optimizes textual prompts using multiple test-time augmentations and remains a strong baseline to date. In this work, we revisit TPT and reveal that its optimization can be interpreted as implicitly learning from self-generated pseudo labels. Building on this perspective, we propose a unified self-ensembling framework USE that jointly refines the optimization and inference stages. During optimization, we introduce a simple yet effective self-ensembling SE strategy that emphasizes the test image itself over its augmented views adaptively to obtain more reliable pseudo labels. To fully exploit the potential of augmentation, we further apply the same strategy at inference time, unifying the objectives of both stages. Notably, SE can also act as a lightweight training-free TTA method. Extensive experiments across multiple datasets demonstrate that SE and USE outperform their counterparts, respectively. Furthermore, SE yields consistent performance improvements when integrated with existing TTA methods.