Poster
in
Workshop: Next Generation of AI Safety
Exploring Scaling Trends in LLM Robustness
Nikolaus Howe · Michał Zając · Ian McKenzie · Oskar Hollinsworth · Pierre-Luc Bacon · Adam Gleave
Keywords: [ Language Model ] [ Scale ] [ Transfer ] [ Adversarial training ] [ robustness ] [ LLM ]
Language model capabilities predictably improve from scaling the model's size and training data.Motivated by this, increasingly large language models have been trained, yielding an array of impressive capabilities.Yet these models suffer from adversarial prompts such as ``jailbreaks'' that hijack models to perform undesired behavior, posing a significant risk of misuse.Prior work has found that computer vision models become more robust with model and data scaling, raising the question: does language model robustness also improve with scale?We study this question empirically, finding that larger models respond substantially more effectively to adversarial training, but there is little to no benefit from model scale in the absence of defenses.