Reasoning Models Are Test Exploiters: Rethinking Multiple Choice
Narun Raman ⋅ Taylor Lundy ⋅ Kevin Leyton-Brown
Abstract
When evaluating Large Language Models (LLMs) in question-answering domains, multiple-choice question answering (MCQA) is widely used because it enables automatic grading. However, MCQA also exposes models to answer options that can be exploited in ways that inflate reasoning ability. We study this phenomenon across $15$ question-answering benchmarks and $27$ LLMs by systematically varying how and when models are exposed to answer options. For non-reasoning LLMs, MCQA can remain a good proxy for free-text performance when any chain-of-thought is produced only before the options are revealed. However, this "decoupled" format is not realizable for most reasoning models: they are designed to emit reasoning tokens whenever they are prompted, so if options are present they inevitably "reason over" the options. In practice, this makes reasoning models particularly effective at extracting signal from options, and can create large, misleading gains over free-text baselines. To characterize how models exploit MCQA, we introduce diagnostic probes that isolate option-only and question-plus-option exploitation pathways, and we quantify how design choices such as distractor strength and "none-of-the-above" answers effect exploitability. Finally, we examined the practice of multiple choice as an error diagnostic: inferring a model's mistake from the wrong option it picks. On benchmarks where reasoning can be expressed as code, we ask models to output code, we then executed it varying the inputs, and compared the resulting input–output behavior, revealing failure modes that MCQA diagnostics obscure. Lastly, we offer practical guidelines when analyzing results from MCQA that better reflect LLMs' genuine reasoning capabilities.
Successful Page Load