STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack
Xutao Mao ⋅ Liangjie Zhao ⋅ Liutao ⋅ Xiang Zheng ⋅ Hongying Zan ⋅ Cong Wang
Abstract
Red-teaming Vision-Language Models is essential for identifying vulnerabilities where adversarial image-text inputs trigger toxic outputs. Existing approaches treat image generation as a black box, providing only terminal toxicity scores while remaining temporally opaque regarding when and how toxic semantics emerge during multi-step synthesis. We introduce $\textbf{STARE}$, a hierarchical reinforcement learning framework that treats the denoising trajectory as an exploitable attack surface. By synergizing a high-level prompt editor with low-level T2I fine-tuning via Group Relative Policy Optimization (GRPO), STARE achieves a 68\% improvement in Attack Success Rate over state-of-the-art baselines including black box and white-box variants. More importantly, we reveal the Optimization-Induced Phase Alignment phenomenon: while vanilla models exhibit diffuse toxicity, adversarial optimization systematically concentrates conceptual harms into early semantic phases and detail-oriented harms into late refinement. This discovery transforms toxicity formation from a chaotic process into a series of predictable vulnerability windows. This temporal alignment transforms red-teaming from a trial-and-error process into a targeted structural analysis. Our work provides both a potent attack engine and a diagnostic foundation for developing next-generation, phase-aware safety mechanisms. Content warning: This paper contains examples of toxic content that may be offensive or disturbing.
Successful Page Load