ICML BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards

Poster
in
Workshop: Next Generation of AI Safety

BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards

Diego Dorn · Alexandre Variengien · Charbel-Raphaël Segerie · Vincent Corruble

Keywords: [ safeguard ] [ benchmark ] [ AI Safety ] [ LLM ] [ input-output safeguard ]

[ Abstract ] [ Project Page ]

[ Slides] [ Poster] [ OpenReview]

Abstract:

Currently, there is no widely recognised methodology for the evaluation of input-output safeguards for Large Language Models (LLMs), such as offline evaluation of traces, automated assessments, content moderation, and periodic or real-time monitoring.In this document, we introduce the Benchmarks for the Evaluation of LLM Safeguards (BELLS), a structured collection of tests, organised in three categories, for three main goals: (1) established failure tests, based on well-known benchmarks for well-defined failure modes, aiming to compare the performance of current input-output safeguards;(2) emerging failure tests, organised to measure generalisation to never-seen-before failure modes and encourage the development of more general safeguards;(3) next-gen architecture tests, for more complex scaffolding (such as LLM-agents and multi-agent systems), aiming to foster the development of safeguards for future applications for which no safeguard currently exists.Furthermore, we implement and share the first next-gen architecture test, using the MACHIAVELLI environment, along with an interactive visualisation of the dataset.

Chat is not available.

Poster in Workshop: Next Generation of AI Safety

BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards

Diego Dorn · Alexandre Variengien · Charbel-Raphaël Segerie · Vincent Corruble

Poster
in
Workshop: Next Generation of AI Safety