Skip to yearly menu bar Skip to main content


Poster

Causality Based Front-door Denfence Against Backdoor Attack on Language Model

Yiran Liu · Xiaoang Xu · Zhiyi Hou · Yang Yu


Abstract:

We have developed a new framework based on the theory of causal inference to protect language models against backdoor attacks. Backdoor attackers can poison language models with different types of triggers, such as words, sentences, grammar, and style, enabling them to selectively modify the decision-making of the victim model. However, existing defense approaches are only effective when the backdoor attack form meets specific assumptions, making it difficult to counter diverse backdoor attacks. We propose a new defense framework \textbf{F}ront-door \textbf{A}djustment for \textbf{B}ackdoor \textbf{E}limination (FABE) based on causal reasoning that does not rely on assumptions about the form of triggers. This method effectively differentiates between spurious and legitimate associations by creating a 'front door' that maps out the actual causal relationships. The term 'front door' refers to a text that retains the semantic equivalence of the initial input, which is generated by an additional, finely-tuned language model, denoted as the defense model. Our defense experiments against various attack methods at the token, sentence, and syntactic levels reduced the attack success rate from 93.63\% to 15.12\%, improving the defense effect by 2.91 times compared to the best baseline result of 66.61\%, achieving state-of-the-art results. Through ablation study analysis, we analyzed the effect of each module in FABE, demonstrating the importance of complying with the front-door criterion and front-door adjustment formula, which also explains why previous methods failed. We plan to make our code publicly available upon publication.

Live content is unavailable. Log in and register to view live content