Skip to yearly menu bar Skip to main content


Poster

Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation

Randall Balestriero · Romain Cosentino · Sarath Shekkizhar


Abstract: Large Language Models~(LLMs) drive current AI breakthroughs despite very little being known about their internal representations, e.g., how to extract a few informative features to solve a downstream task. To provide a principled and practical solution, we study the transformer architecture in LLMs from a geometric perspective. We obtain in closed form (i) the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and (ii) the partition and per-region affine mappings of the feedforward (MLP) network. Our results are informative, do not rely on approximations, and are actionable. First, we show that, through our geometric understanding, we can bypass Llama$2$'s RLHF by controlling the embedding's intrinsic dimension through informed prompt manipulation. Second, we derive $7$ interpretable spline features that can be extracted from any (pre-trained) LLM layer, providing a rich abstract representation of their inputs. Moreover, we observe that these features are sufficient to help solve toxicity detection, infer the domain of the prompt, and even tackle the Jigsaw challenge (identifying various types of toxicity). Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in language models.

Live content is unavailable. Log in and register to view live content