Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges
Abstract
Large Language Models (LLMs) are increasingly employed as automated judges for evaluating generative models. However, their known stylistic biases, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability. In this work, we introduce BITE (BIas exploraTion and Exploitation), a black-box adversarial framework that learns semantics-preserving edits to mislead the judgment and artificially inflate judged scores. We cast the selection of stylistic edits as a contextual bandit problem and use a LinUCB policy to adaptively choose edits that maximize the judge’s score without access to model parameters or gradients. Theoretically, we prove a formal regret guarantee for our BITE, demonstrating its ability to efficiently learn to manipulate a judge in the realistic setting of model misspecification. Empirically, we test BITE across a diverse range of LLM judges and tasks, including both pointwise and pairwise comparisons on chatbot leaderboards and AI-reviewer benchmarks. BITE achieves an attack success rate (>!65%) and raises scores by (+1)–(2) on a 9-point scale, while maintaining semantic equivalence. We further uncover model-specific "vulnerability fingerprints": judges differ in sensitivity to sentiment, register, and structural cues (e.g., headers), limiting cross-model transferability. Finally, we evaluate stealthiness and show that BITE evades standard style-control and simple detection baselines. Our findings expose a fundamental weakness in the LLM-as-a-judge paradigm and motivate robust, attack-aware evaluation, e.g., style normalization, randomized prompting, and adversarial training of judges.