Deep Residual Injection for Full-Spectrum Forensic Signal Perception in Multimodal Large Language Models
Abstract
Multimodal large language models (MLLMs) have been increasingly adopted in forensics for their robust semantic understanding. As AI-generated images become realistic, semantic-level inconsistencies alone are often insufficient for reliable detection. This motivates a critical question: whether MLLMs can achieve full-spectrum forensic signal perception, i.e., capturing low-level generator artifacts without sacrificing pre-trained semantic knowledge. We then conduct a layer-wise analysis of forensic signal perception in MLLMs and find that semantic information is mainly encoded in the early-to-middle layers, and directly fine-tuning MLLMs for artifact learning causes rapid semantic forgetting. Based on this insight, we propose Deep Visual Residual MLLM (Deep-VRM) to \textit{preserve early semantic processing while injecting artifact-specific visual signals as a residual path into an intermediate layer}, where they are fused with semantic token representations and propagated through subsequent trainable layers. This enables later layers to jointly model semantic reasoning and signal-level forensic cues, and surprisingly, the model learns to adaptively leverage different levels of forensic signals depending on the input, achieving robust and generalizable detection performance. Extensive experiments show that our method achieves state-of-the-art across all benchmarks.