From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs
Abstract
Backdoor attacks can introduce deceptive behaviors into large language models, causing them to execute prohibited actions only when specific secret triggers appear in the input. Existing safety training methods largely fail to address this vulnerability, due to the inherent difficulty of uncovering hidden triggers embedded within the model. Motivated by recent findings on LLMs’ situational awareness, we propose a novel post-training framework that cultivates backdoor self-awareness, enabling a poisoned LLM to precisely articulate its own implanted triggers. At its core, our approach introduces an inversion-inspired reinforcement learning framework that encourages models to introspectively reason about their behaviors and gradually reverse-engineer the triggers responsible for misaligned outputs. Building upon precise trigger articulation, we further present two complementary defense strategies for mitigating and detecting backdoor threats. Experiments on five backdoor attacks, compared against six baseline methods, demonstrate that our approach has strong potential to improve the robustness of LLMs against backdoor risks.