Poster
in
Workshop: Foundations of Reinforcement Learning and Control: Connections and Perspectives
Certifying robustness to adaptive data poisoning
Avinandan Bose · Madeleine Udell · Laurent Lessard · Maryam Fazel · Krishnamurthy Dvijotham
The rise of foundational models fine-tuned with human feedback from potentially untrusted users has increased the risk of adversarial data poisoning, necessitating the study of robustness of learning algorithms against such attacks. While existing research focuses on certifying robustness for static adversaries acting on offline datasets, dynamic attack algorithms have shown to be more effective. Relevant for models with periodic updates where an adversary can adapt based on the algorithm's behavior, such as those in RLHF, we present a novel framework for computing certified bounds on the impact of dynamic poisoning, and use these certificates to design robust learning algorithms. We give an illustration of the framework for the mean-estimation problem.