Criterion-Conditional In-Context Learning: Evaluating Criterion-Shift Adaptation in Vision-Language Models
Abstract
Vision-language models can perform new tasks without parameter updates through in-context learning (ICL), whose core mechanism is utilizing the support set for task induction. In standard ICL setting, once the task is induced, its decision boundary, i.e., the criterion, remains fixed. However, in real-world applications, many tasks exhibit a stable high-level intent, while their decision criteria shift according to specific requirements. Thus we introduce a new test setting, denoted as Criterion-Conditional In-Context Learning (CC-ICL), where models must infer the latent criterion from context under a fixed task semantics. To evaluate this capability, we propose two complementary metrics, Criterion-Sensitivity and Criterion-Invariance, capturing model's robustness and adaptability under criterion shifts. We further construct CC-Bench, a multi-domain benchmark that supports evaluation under the CC-ICL setting through hierarchical annotation, enabling legitimate ground-truth variation under fixed tasks. Experiments on CC-Bench reveal that most models exhibit a ''rigid boundary'' bias, struggling to align their decisions with the latent criterion. We also find that even a simple multi-criteria training strategy can significantly reduce this bias, improving Criterion-Sensitivity and enabling 7B-scale models to surpass proprietary models without degrading general multimodal performance.