OSCS: Online Selection with Provable FAR Control for LLM Safety
Abstract
Large language models (LLMs) are vulnerable to malicious inputs, posing serious risks in high-stakes applications. While existing detection-based methods have shown strong empirical performance, they fail to explicitly control the false acceptance rate (FAR), a critical safety metric in sensitive scenarios. This limitation is further compounded by two key challenges: the absence of access to known malicious samples and the dynamic, online nature of real-world data streams. To address these challenges, we propose \textit{OSCS}, a novel framework designed to enable online FAR control without relying on malicious calibration samples. OSCS leverages detection scores from existing defenses and employs recursive density estimation to infer benign likelihoods directly from the test stream. This approach allows OSCS to make real-time accept/reject decisions while adhering to a user-specified FAR threshold. We provide theoretical guarantees demonstrating that OSCS effectively controls the FAR, with only a vanishing excess term under mild conditions. Extensive experiments on backdoor attacks and jailbreak scenarios further validate OSCS's effectiveness, showing that it consistently achieves robust FAR control across a variety of tasks and attack settings. These results underscore the practicality of OSCS for ensuring safety in critical LLM deployments.