Attention Hijacking: Backdooring Text Dataset Distillation via Semantic Anchors
Abstract
Dataset Distillation (DD) has emerged as a promising technique for compressing large-scale datasets into compact synthetic sets while preserving model performance. However, the security implications of this paradigm, particularly within the Transformer-based text classification domain, remain underexplored. In this paper, we identify "Distilled Attention Labels" as a pivotal yet overlooked vulnerability. We propose Attention Hijacking (AH), a stealthy backdoor attack that manipulates the bi-level optimization process to explicitly hijack the attention mechanism of target models via synthetic data. Distinct from traditional poisoning that often compromises clean accuracy, AH achieves stealthiness without utility degradation. To explain this, we formulate the "Semantic Anchoring Hypothesis", characterizing the interaction between trigger semantics and attack mechanisms. We demonstrate that AH functions as a semantic-adaptive mechanism: when triggers align with domain-specific semantic anchors (e.g., "film" in sentiment analysis), our method achieves a synergistic effect, boosting both Attack Success Rates (ASR >99%) and Clean Test Accuracy (CTA). Conversely, for functional or noise triggers, AH enforces attention segregation to prevent utility collapse, maintaining exceptional robustness where baseline attacks fail. Extensive experiments across multiple datasets and varying model scales—ranging from BERT-Tiny to BERT-Base—validate the scalability and dominance of AH. Our findings reveal that attention-based distillation is a double-edged sword, underscoring the urgent need for robust defenses in the era of data-efficient learning.