Choosing Training-Time Calibration Objectives for Frozen Foundation-Model Features: A Linear-Probing Benchmark
Abstract
Calibration objectives for deep classifiers have historically been designed under end-to-end training. Foundation models, however, are increasingly used through frozen-feature adaptation, and full fine-tuning to recalibrate is often infeasible. Post-hoc temperature scaling is cheap but limited to a scalar transform. We ask whether calibration-aware linear probing—relearning only the head under a calibration objective—can occupy the middle ground. Across 15 dataset–model settings spanning CLIP, DINOv2, same-domain CNNs, and cross-domain CNN transfer, the answer is a clean representation-family split rather than a universal winning loss. CLIP gains, when present, come from a direct confidence–accuracy penalty. DINOv2 leaves little reliable headroom beyond temperature scaling. Same-domain CNNs favor confidence- and margin-sensitive reweighting, including a new diagnostic V-family introduced here. Calibration-aware probing therefore serves both as a lightweight recalibration tool and as a diagnostic that exposes how frozen representations encode confidence. Objective choice is part of evaluating uncertainty on frozen foundation-model features, not a minor implementation detail.