Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away
Soumya Suvra Ghosal ⋅ Souradip Chakraborty ⋅ Vaibhav Singh ⋅ Furong Huang ⋅ Dinesh Manocha ⋅ Amrit Singh Bedi
Abstract
Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60 % (e.g., LlamaV-o1: 63.33\% $\rightarrow$5.74\% on JailbreakV-28K, R1-OneVision: 69.07\%$\rightarrow$5.65\% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20\%$\rightarrow$65.00\%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first $1–3$ reasoning steps typically suffices to redirect the full generation toward safe completions.
Successful Page Load