Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Next Generation of AI Safety

Exploring Scaling Trends in LLM Robustness

Nikolaus Howe · Michał Zając · Ian McKenzie · Oskar Hollinsworth · Pierre-Luc Bacon · Adam Gleave

Keywords: [ Language Model ] [ Scale ] [ Transfer ] [ Adversarial training ] [ robustness ] [ LLM ]


Abstract:

Language model capabilities predictably improve from scaling the model's size and training data.Motivated by this, increasingly large language models have been trained, yielding an array of impressive capabilities.Yet these models suffer from adversarial prompts such as ``jailbreaks'' that hijack models to perform undesired behavior, posing a significant risk of misuse.Prior work has found that computer vision models become more robust with model and data scaling, raising the question: does language model robustness also improve with scale?We study this question empirically, finding that larger models respond substantially more effectively to adversarial training, but there is little to no benefit from model scale in the absence of defenses.

Chat is not available.