DeflectBench: A Benchmark for Evaluating Rhetorical Fallacy Generation in LLMs
Art Kanke
Abstract
Whether large language models can be prompted to generate rhetorical fallacies on demand, and whether current safety post-training constrains this behavior, has received less attention than the related question of detecting fallacies in existing text. We close this gap with DeflectBench, evaluating $23{,}990$ generations from four frontier models across three deflection strategies (whataboutism, ad hominem, red herring), seven prompt framings, and $80$ claims spanning four controversy levels. Refusal is governed primarily by request structure rather than claim content. Per claim refusal varies by only $11$ percentage points across the $80$ claims, while a single prompt-frame change can swing within-model refusal by nearly $100$ percentage points. An educational debate coach prompt framing collapses refusal to near zero across all four model families, but the bypassed behavior is not clean compliance. Models typically produce labeled compliance, naming the requested manipulation in the same response that contains it. The four models distribute differently across refusal, labeled compliance, soft refusal, and clean compliance, suggesting that alignment policies diverge across laboratories on rhetorical manipulation rather than converging on a single safety norm.
Successful Page Load