Position: Benchmarks Do Not Measure Deployment Readiness in Clinical AI
Abstract
Despite large language models (LLMs) achieving impressive performance on benchmark tasks such as medical question answering, their real-world utility remains limited. We argue that while benchmarks play a valuable role in developing methods and filtering promising models during development, they often tell us very little about deployment readiness. Many health AI systems with strong retrospective accuracy have failed in practice, while others with modest benchmark performance have demonstrated meaningful clinical benefits. We detail the limitations of benchmark-centric evaluations of deployment readiness. We argue that we should only use benchmarks to find candidate methods or models, not to justify deployment. We call for increased use of prospective studies and policy changes that align incentives with clinically grounded evaluation.