Eliminating Discriminative Shortcuts in Multiple Choice Evaluations with Answer Matching
Abstract
Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show that popular multiple-choice benchmarks admit superficial shortcuts that yield high accuracy without even looking at the questions, reflecting a fundamental limitation of discriminative evaluation not shared by evaluations of the model’s free- form, generative answers. To circumvent this issue, we consider a scalable method for generative evaluation, which we call answer matching: Give the candidate model the question with- out the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the answer matches the reference. Comparing multiple choice, “LLM-as-judge” without references, and answer-matching evaluations against human grading, we find that multiple-choice aligns poorly with humans, while answer matching using recent models — even small ones — achieves near-perfect alignment within inter-grader agreement. In light of this, we discuss how to move the evaluation format from multiple choice to answer matching.