ST-Veto: Spatio-Temporal Token Veto for Diffusion MLLMs via Taylor Prediction and Visual Grounding
Abstract
Vision Language Models (VLMs) achieve strong reasoning with Chain-of-Thought (CoT) prompting but incur high sequential-generation cost, error accumulation, and limited self-correction. Diffusion Multimodal Large Language Models (dMLLMs) unmask tokens in an order-agnostic process, improving efficiency and enabling self-correction, yet their reasoning and how to enhance it remain underexplored. We propose a training-free method, Spatio-Temporal token Veto (ST-Veto), leveraging the ability to observe all tokens at each diffusion step. ST-Veto vetoes temporally unstable tokens via second-order Taylor prediction of confidence dynamics and filters weakly grounded tokens using image attention mass, swapping them with safer candidates. Across multiple dMLLMs and multimodal reasoning benchmarks, ST-Veto consistently outperforms standard decoding policies and prior VLM reasoning methods, improving accuracy by up to 9\% with no additional training or generation cost. Analyses show that ST-Veto steers generation toward higher-confidence, better-grounded paths, and we will release our code upon publication.