Uni-DocRobust: Universal Plug-and-Play Robustness Enhancement for Multi-modal LLMs via Feature Restoration
Abstract
Real-world degradations, such as noise, blur, and low resolution, significantly impair the performance of Multi-modal Large Language Models (MLLMs) in document understanding tasks. Despite recent advancements, progress in this field remains stifled by two critical bottlenecks: the scarcity of large-scale, aligned training data necessary for learning robustness, and the lack of transferable restoration solutions across diverse MLLM architectures. To bridge the data gap, we first present DocRobust-VQA, a large-scale dataset explicitly constructed to support robustness training. Comprising 189K aligned low/high-quality document image pairs and 417K QA pairs, it provides the first substantial corpus for fine-tuning MLLMs to handle varying degradation conditions. Leveraging this data, we propose Uni-DocRobust, a universal plug-and-play framework that decouples restoration capabilities from specific visual encoders. Our method employs a frozen Universal Restoration Core pre-trained in a canonical feature space via multi-teacher distillation, which can be seamlessly integrated into target MLLMs (e.g., Qwen-VL, InternVL) through lightweight Feature Adapters. Extensive experiments demonstrate that Uni-DocRobust significantly enhances robust performance on MLLMs and enables a cost-effective ``pre-train once, deploy everywhere'' paradigm for robust MLLM deployment.