TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models
Abstract
Large vision-language models (LVLMs) excel at vision-language tasks but remain vulnerable to backdoor attacks. Most existing backdoor attacks on LVLMs force the model to generate predefined target patterns. However, these fixed-pattern attacks are easy to detect, as the model tends to memorize frequent patterns and exhibits overconfidence on targets given poisoned inputs. To address these limitations, we introduce TokenSwap, a more evasive and stealthy backdoor attack that focuses on the \emph{compositional understanding} capabilities of LVLMs. Instead of enforcing a fixed targeted content, TokenSwap subtly disrupts the understanding of object relationships in text. Specifically, it causes the backdoored model to generate outputs that mention the correct objects in the image but misrepresent their relationships (i.e., bags-of-words behavior). During training, TokenSwap injects a visual trigger into selected samples while swapping the grammatical roles of key tokens in the textual answers. Since the poisoned samples differ only subtly from clean ones, an adaptive token-weighted loss is employed to emphasize learning on swapped tokens, strengthening the association between visual triggers and the bags-of-words behavior. Extensive experiments demonstrate that TokenSwap achieves high attack success rates while maintaining evasiveness and stealthiness across multiple benchmarks and LVLM architectures.