Adaptive Generation of Bias-Eliciting Questions for LLMs
Abstract
Large language models (LLMs) are now widely deployed in user-facing applications, reaching hundreds of millions of users worldwide. Despite their widespread adoption, growing reliance on their outputs raises significant concerns, particularly as users may be exposed to model-inherent biases that disadvantage or stereotype certain groups. However, existing bias benchmarks commonly rely on simple templated prompts or restrictive multiple-choice questions that fail to capture the complexity of real-world user interactions. In this work, we address this gap by introducing a counterfactual framework that automatically generates realistic, open-ended questions for LLM bias evaluation. Through iterative question mutation, our approach further systematically explores areas where models are most susceptible to exhibit biased behavior. Beyond just detecting harmful biases, we also capture increasingly relevant response dimensions, such as asymmetric refusals and explicit bias acknowledgment. Building on this, we construct CAB, a diverse and human-verified benchmark for realistic and nuanced bias evaluations on current frontier LLMs. Our evaluation using CAB highlights the continued need for fairness research by demonstrating that all examined models exhibit persistent biases across certain scenarios.