AlignedNorm: Prompting Vision–Language Models via Coupled Prompt Field
Abstract
Prompt learning for vision-language models (VLMs) primarily follows end-to-end or decoupled routes to balance base and new task performance, but suffers a fundamental bottleneck: sample-wise optimization within task-specific feature spaces traps models in local optima, hindering global optimality. To address this, we identify a key insight that VLMs can be prompted within a Coupled Prompt Field-a shared space where base and new tasks are mutually constrained-and present AlignedNorm, which enforces the field coupling. By dynamically aligning the norms of prompts to VLMs' native scale, our method enables joint optimization of both tasks. Without complex designs, our method matches leading decoupled approaches on 15 datasets across 4 experimental settings, offering both a new perspective and a practical solution to the local-optima dilemma in prompt learning.