Position: The Open Benchmark Paradox Must Be Resolved through Sovereign Medical Evaluation
Abstract
As medical large language models become increasingly involved in clinical actions, public benchmarks are often treated as proxies of deployment-readiness. However, this reliance creates a false sense of security because public scores are often based on data the models have already seen. We call this the Open Benchmark Paradox: making evaluation data public for research progress also makes data contamination inevitable, ruining its value as a reliable safety signal. This paradox induces three structural failures: (1) hidden contamination, where it is impossible to prove evaluation independence; (2) outdated standards, where static datasets fail to track evolving medical guidelines; and (3) jurisdictional divergence, where global averaging ignores local legal and ethical standards. To validate these risks, we audited frontier models using recent medical exam data, which confirmed a high probability of data contamination. To resolve such integrity issues in medical evaluation, we propose Sovereign Medical Evaluation (SME). Instead of public leaderboards, SME establishes a national infrastructure where health authorities manage private, isolated evaluation pipelines. Within this secure system, evaluations are automatically updated using live medical data and legal changes, ensuring they remain current and strictly separated from model training. SME provides the essential transition to a controlled, auditable, and legally grounded safety gate for medical AI.