Robust Vision-Language Models via Manifold-Adversarial Adapters
Abstract
Vision-language models (VLMs) have progressed rapidly with large-scale high-quality data and adaptation strategies, yet remain brittle under real-world corruptions, where both visual recognition and language-grounded reasoning degrade. Beyond cascaded image restoration, a natural alternative is parameter-efficient adaptation that aligns corrupted features with clean references; however, Euclidean alignment alone is not semantics-preserving and can even harm downstream reasoning. We attribute this to a semantic misalignment gap, where features become geometrically closer while drifting off the in-distribution support on which multimodal reasoning is calibrated. To address this, we propose Manifold-Adversarial Adapters (MAA), lightweight layer-wise modules for a frozen vision encoder that explicitly steer corrupted features back onto the clean in-distribution manifold rather than merely shrinking feature-space distance. MAA combines paired feature self-distillation with a token-level adversarial manifold constraint to prevent off-manifold semantic shortcuts. At inference, only the adapters are retained, enabling single-stage robustness with negligible overhead and avoiding the latency and semantic drift of restoration pipelines. Across benchmarks and corruption settings, MAA consistently improves performance over strong baselines.