Skip to yearly menu bar Skip to main content


Poster

Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

Xavi Suau · Pieter Delobelle · Rin Susa · Armand Joulin · Nicholas Apostoloff · Luca Zappella · Pau Rodriguez


Abstract: An important issue with Large Language Models (LLMs) is their undesired ability to generate toxic language. In this work, we show that the neurons responsible for toxicity can be determined by their power to discriminate toxic sentences, and that toxic language can be mitigated by reducing their activation levels proportionally to this power. We propose WhispX, an intervention that can be applied to any pre-trained LLM to mitigate toxicity. As the intervention is proportional to the ability of each neuron to discriminate toxic content, it is free of any model-dependent hyperparameters. We show that WhispX can achieve up to $2.1\times$ reduction in toxicity with only a $0.49$ perplexity increase. We also show that WhispX is effective with models of different scale (from 1.5B to 40B parameters), and its effectiveness in mitigating toxic language, while preserving common-sense zero-shot abilities, holds across all scales. WhispX can be combined with pre-prompting strategies, boosting its average mitigation potential from $1.29\times$ to $2.39\times$. Moreover, WhispX can counteract adversarial pre-prompts that maliciously elicit toxic content, making it an effective method for deploying safer and less toxic models.

Live content is unavailable. Log in and register to view live content