ClinAuth: A Multi-Agent Benchmark and Dataset for Quantifying Omission Bias and Authority Deference in Hierarchical Medical Large Language Models
Abstract
Large Language Models (LLMs) are increasingly used in clinical decision support and electronic health record (EHR) systems, achieving strong performance on medical benchmarks and instruction-following tasks. However, such performance does not ensure safe behavior in hierarchical, high-stakes settings. Alignment methods such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) can encourage sycophantic behavior, leading to a failure mode we term clinical omission bias, where models permit harmful decisions in the form of silence to avoid conflict/penalties. On the other hand, Reinforcement Learning from Human/Artificial Intelligence Feedback (RLHF)/(RLAIF) lead to unnecessary and false interventions, preventing proper treatments, inducing a clinical commission bias. We introduce the Clinical Authority Family, including the ClinAuth Benchmark and ClinAuth-800 dataset, a multi-agent framework for evaluating LLM behavior under authority pressure. We simulate a hospital setting in which a model evaluates treatment decisions proposed by an attending physician and controls EHR access. By varying physician tone, system-level incentives, and weak-to-strong oversight, we study the effects of alignment methods across frontier models. \textbf{In 20,000 simulations, frontier models falsely refuse correct treatments in 17.08\% of control cases and allow fatal clinical errors in 37.33\% of harmful cases, resulting in an overall harmful EHR interaction rate of 27.17\% among all cases.}