Probabilistic Robustness Certificates against Adversarial Attacks
Sara Taheri ⋅ Majid Zamani
Abstract
The growing use of machine learning in safety-critical settings heightens vulnerability to *adversarial attacks*. Existing defense mechanisms typically either lack formal guarantees or depend on restrictive assumptions about the model family, the threat model, or the poisoning budget, and many only offer point-wise certification. Importantly, they often overlook the inherent stochasticity of modern training pipelines, which undermines their practical reliability. We introduce a probabilistic framework that views gradient-based training as a *discrete-time stochastic dynamical system* and formulates poisoning robustness as a safety verification task. Leveraging *barrier certificates* (BCs), we derive sufficient conditions to probabilistically certify a robust radius against worst-case ${\ell}_p$-bounded poisoning, guaranteeing that the final model parameters remain within a safe set. For tractable computation, we represent BCs with neural networks and obtain *probably approximately correct* (PAC) guarantees through a *scenario convex problem*. Our method identifies the largest certified radius for which the trained model is probabilistically accurate with a specified confidence level. Experiments on MNIST, SVHN, and CIFAR-10 show that our framework offers formal robustness guarantees under stochastic training, while being model-agnostic and not requiring prior knowledge of the attack strategy.
Successful Page Load