TRACER: Persistent Regularization for Robust Multimodal Finetuning
Abstract
Finetuning pretrained multimodal models improves in-distribution performance but often degrades out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. We develop a theoretical framework for multimodal contrastive finetuning by introducing a contrastive target matrix that reformulates the objective as a matrix least-squares problem, yielding closed-form solutions and a geometric decomposition of how different strategies manage pretrained knowledge. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from late-stage collapse where the teacher--student gap vanishes precisely when OOD robustness is most vulnerable. We prove that a Weighted Moving Average (WMA) teacher, which integrates the full optimization trajectory, maintains a persistent regularizing force over finite horizons, enabling bias-free convergence in the task subspace while preserving orthogonal knowledge. These insights motivate TRACER (Trajectory-Robust Anchoring for Contrastive Encoder Regularization), which combines contrastive learning with WMA-guided multi-perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures. Comprehensive ablations across four axes (distillation components, regularization strength, update frequency, and kernel shape) confirm that TRACER is both principled and robust to hyperparameter choices.