Poster
in
Workshop: Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance Thu, Jul 9, 2026 • 7:00 PM – 8:00 PM PDT

MultiVulnBench: A Large-Scale Benchmark for Count Bias in LLM-Based Multi-Vulnerability Detection

Manan Gupta ⋅ Chinmay Pushkar ⋅ Sanchit Kabra ⋅ Dhruv Kumar ⋅ Jagat S Challa

Project Page

Abstract

Large Language Models (LLMs) achieve near-perfect performance on single-vulnerability detection yet suffer a systematic, underexplored failure when files contain multiple co-located vulnerabilities: recall collapses as vulnerability density grows, a phenomenon we term $\textbf{count bias}$. Existing benchmarks frame detection as binary classification of individual functions and cannot expose this failure mode. We present $\textbf{MultiVulnBench}$, the first large-scale benchmark designed to measure count bias in LLM-based vulnerability detection. MultiVulnBench comprises $\textbf{20,000 files}$ across four languages (Python, C, C++, JavaScript) at five controlled density levels ($N \in \{0,1,3,5,9\}$ vulnerabilities/file), evaluated with five state-of-the-art LLMs under zero-shot prompting. We introduce the $\textbf{ExactFile}$ metric, the fraction of files where the model identifies all vulnerabilities correctly, which captures complete audit accuracy better than F1 alone. Our central finding is that count bias is both universal and catastrophic: at $N=9$, ExactFile accuracy falls to single digits for every model and language, regardless of model size or family. Mistral-3.2-24B achieves $F_1=0.974$ with $95.8\%$ ExactFile on JavaScript at $N=1$; by $N=9$ this collapses to $F_1=0.577$ (-41%) with ExactFile of $5.2\%$, meaning the model produces a complete, correct audit less than 1 in 20 times. All five models share the same failure signature: Precision stays near $1.0$ while Recall collapses, confirming a systematic under-prediction rather than mis-classification. Count error, measured by Mean Absolute Error on predicted vulnerability counts, grows monotonically with $N$ for all models. We additionally expose a dataset composition pathology, CWE homogeneity at specific density levels, that inflates apparent performance and must be controlled in future benchmark design.