2nd ICML Workshop on New Frontiers in Adversarial Machine Learning

Fri 11:50 a.m. - 12:00 p.m.

Opening ( Opening ) >
SlidesLive Video

The opening remarks of the workshop.

🔗

Fri 12:00 p.m. - 12:30 p.m.

Una-May O'Reilly ( Keynote ) >
SlidesLive Video

Bio: Una-May O'Reilly is the leader of ALFA Group at MIT-CSAIL. An AI and machine learning researcher for 20+ years, she is broadly interested in artificial adversarial intelligence -- the notion that competition has complex dynamics due to learning and adaptation signaled by experiential feedback. This interest directs her to the topic of security where she has developed machine learning algorithms that variously consider the arms races of malware, network and model attacks and the uses of adversarial inputs on deep learning models. Her passions are evolutionary computation and programming. This frequently leads her to investigate Genetic Programming. As well, it draws her to investigations of coevolutionary dynamics between populations of cooperative agents or adversaries, in settings as general as cybersecurity and machine learning.

Talk: Adversarial Intelligence Supported by Machine Learning

Abstract: My interest is in computationally replicating the behavior of adversaries who target algorithms/code/scripts at vulnerable targets and the defenders who try to stop the threats. I typically consider networks as targets but let's consider the most recent ML models - foundation models. How do goals blur in the current context where the community is trying to simultaneously address their safety and security?

Una-May O'Reilly 🔗

Fri 12:30 p.m. - 1:00 p.m.

Lea Schönherr ( Keynote ) >
SlidesLive Video

Bio: Lea Schönherr is a tenure track faculty at CISPA Helmholtz Center for Information Security since 2022. She obtained her PhD from Ruhr-Universität Bochum, Germany, in 2021 and is a recipient of two fellowships from UbiCrypt (DFG Graduate School) and Casa (DFG Cluster of Excellence). Her research interests are in the area of information security with a focus on adversarial machine learning and generative models to defend against real-world threats. She is particularly interested in language as an interface to machine learning models and in combining different domains such as audio, text, and images. She has published several papers on threat detection and defense of speech recognition systems and generative models.

Title: Brave New World: Challenges and Threats in Multimodal AI Agent Integrations

Abstract: Being on the rise, AI agents become more integrated into our daily lives and will soon be indispensable for countless downstream tasks, be it translation, text enhancing, summarisation or other assisting applications like code generation. As of today, the human-agent interface is no longer limited to plain text and large language models (LLMs) can handle documents, videos, images, audio and more. In addition, the generation of various multimodal outputs is becoming more advanced and realistic in appearance, allowing for more sophisticated communication with AI agents. Particularly in the future, agents will rely on a more natural-feeling voice interface for interactions with AI agents. In this presentation, we will take a closer look at the resulting challenges and security threats associated with integrated multimodal AI agents, which relate to two possible categories: Malicious inputs used to jailbreak LLMs, as well as computer-generated output that is indistinguishable from human-generated content. In the first case, specially designed inputs are used to exploit an LLM or its embedding system, also referred to as prompt hacking. Existing attacks show that content filters of LLMs can be easily bypassed with specific inputs and that private information can be leaked. The use of additional input modalities, such as speech, allows for a much broader potential attack surface that needs to be investigated and protected. In the second case, generative models are utilized to produce fake content that is nearly impossible to distinguish from human-generated content. This fake content is often used for fraudulent and manipulative purposes and impersonation and realistic fake news is already possible using a variety of techniques. As these models continue to evolve, detecting these fraudulent activities will become increasingly difficult, while the attacks themselves will become easier to automate and require less expertise. This creates significant challenges for preventing fraud and the uncontrolled spread of fake news.

Lea Schönherr 🔗

Fri 1:00 p.m. - 1:10 p.m.

Adversarial Training Should Be Cast as a Non-Zero-Sum Game ( Oral ) > link
SlidesLive Video

One prominent approach toward resolving the adversarial vulnerability of deep neural networks is the two-player zero-sum paradigm of adversarial training, in which predictors are trained against adversarially-chosen perturbations of data. Despite the promise of this approach, algorithms based on this paradigm have not engendered sufficient levels of robustness, and suffer from pathological behavior like robust overfitting. To understand this shortcoming, we first show that the commonly used surrogate-based relaxation used in adversarial training algorithms voids all guarantees on the robustness of trained classifiers. The identification of this pitfall informs a novel non-zero-sum bilevel formulation of adversarial training, wherein each player optimizes a different objective function. Our formulation naturally yields a simple algorithmic framework that matches and in some cases outperforms state-of-the-art attacks, attains comparable levels of robustness to standard adversarial training algorithms, and does not suffer from robust overfitting.

Fri 11:50 a.m. - 12:00 p.m.	Opening ( Opening ) > SlidesLive Video The opening remarks of the workshop.	🔗
Fri 12:00 p.m. - 12:30 p.m.	Una-May O'Reilly ( Keynote ) > SlidesLive Video Bio: Una-May O'Reilly is the leader of ALFA Group at MIT-CSAIL. An AI and machine learning researcher for 20+ years, she is broadly interested in artificial adversarial intelligence -- the notion that competition has complex dynamics due to learning and adaptation signaled by experiential feedback. This interest directs her to the topic of security where she has developed machine learning algorithms that variously consider the arms races of malware, network and model attacks and the uses of adversarial inputs on deep learning models. Her passions are evolutionary computation and programming. This frequently leads her to investigate Genetic Programming. As well, it draws her to investigations of coevolutionary dynamics between populations of cooperative agents or adversaries, in settings as general as cybersecurity and machine learning. Talk: Adversarial Intelligence Supported by Machine Learning Abstract: My interest is in computationally replicating the behavior of adversaries who target algorithms/code/scripts at vulnerable targets and the defenders who try to stop the threats. I typically consider networks as targets but let's consider the most recent ML models - foundation models. How do goals blur in the current context where the community is trying to simultaneously address their safety and security?	Una-May O'Reilly 🔗
Fri 12:30 p.m. - 1:00 p.m.	Lea Schönherr ( Keynote ) > SlidesLive Video Bio: Lea Schönherr is a tenure track faculty at CISPA Helmholtz Center for Information Security since 2022. She obtained her PhD from Ruhr-Universität Bochum, Germany, in 2021 and is a recipient of two fellowships from UbiCrypt (DFG Graduate School) and Casa (DFG Cluster of Excellence). Her research interests are in the area of information security with a focus on adversarial machine learning and generative models to defend against real-world threats. She is particularly interested in language as an interface to machine learning models and in combining different domains such as audio, text, and images. She has published several papers on threat detection and defense of speech recognition systems and generative models. Title: Brave New World: Challenges and Threats in Multimodal AI Agent Integrations Abstract: Being on the rise, AI agents become more integrated into our daily lives and will soon be indispensable for countless downstream tasks, be it translation, text enhancing, summarisation or other assisting applications like code generation. As of today, the human-agent interface is no longer limited to plain text and large language models (LLMs) can handle documents, videos, images, audio and more. In addition, the generation of various multimodal outputs is becoming more advanced and realistic in appearance, allowing for more sophisticated communication with AI agents. Particularly in the future, agents will rely on a more natural-feeling voice interface for interactions with AI agents. In this presentation, we will take a closer look at the resulting challenges and security threats associated with integrated multimodal AI agents, which relate to two possible categories: Malicious inputs used to jailbreak LLMs, as well as computer-generated output that is indistinguishable from human-generated content. In the first case, specially designed inputs are used to exploit an LLM or its embedding system, also referred to as prompt hacking. Existing attacks show that content filters of LLMs can be easily bypassed with specific inputs and that private information can be leaked. The use of additional input modalities, such as speech, allows for a much broader potential attack surface that needs to be investigated and protected. In the second case, generative models are utilized to produce fake content that is nearly impossible to distinguish from human-generated content. This fake content is often used for fraudulent and manipulative purposes and impersonation and realistic fake news is already possible using a variety of techniques. As these models continue to evolve, detecting these fraudulent activities will become increasingly difficult, while the attacks themselves will become easier to automate and require less expertise. This creates significant challenges for preventing fraud and the uncontrolled spread of fake news.	Lea Schönherr 🔗
Fri 1:00 p.m. - 1:10 p.m.	Adversarial Training Should Be Cast as a Non-Zero-Sum Game ( Oral ) > link SlidesLive Video One prominent approach toward resolving the adversarial vulnerability of deep neural networks is the two-player zero-sum paradigm of adversarial training, in which predictors are trained against adversarially-chosen perturbations of data. Despite the promise of this approach, algorithms based on this paradigm have not engendered sufficient levels of robustness, and suffer from pathological behavior like robust overfitting. To understand this shortcoming, we first show that the commonly used surrogate-based relaxation used in adversarial training algorithms voids all guarantees on the robustness of trained classifiers. The identification of this pitfall informs a novel non-zero-sum bilevel formulation of adversarial training, wherein each player optimizes a different objective function. Our formulation naturally yields a simple algorithmic framework that matches and in some cases outperforms state-of-the-art attacks, attains comparable levels of robustness to standard adversarial training algorithms, and does not suffer from robust overfitting. Link	Alex Robey · Fabian Latorre · George J. Pappas · Hamed Hassani · Volkan Cevher 🔗
Fri 1:10 p.m. - 1:20 p.m.	Evading Black-box Classifiers Without Breaking Eggs ( Oral ) > link SlidesLive Video Decision-based evasion attacks repeatedly query a black-box classifier to generate adversarial examples.Prior work measures the cost of such attacks by the total number of queries made to the classifier. We argue this metric is flawed. Most security-critical machine learning systems aim to weed out "bad" data (e.g., malware, harmful content, etc). Queries to such systems carry a fundamentally asymmetric cost: queries detected as "bad" come at a higher cost because they trigger additional security filters, e.g., usage throttling or account suspension. Yet, we find that existing decision-based attacks issue a large number of "bad" queries, which likely renders them ineffective against security-critical systems. We then design new attacks that reduce the number of bad queries by $1.5$-$7.3\times$, but often at a significant increase in total (non-bad) queries. We thus pose it as an open problem to build black-box attacks that are more effective under realistic cost metrics. Link	Edoardo Debenedetti · Nicholas Carlini · Florian Tramer 🔗
Fri 1:20 p.m. - 1:30 p.m.	Tunable Dual-Objective GANs for Stable Training ( Oral ) > link SlidesLive Video In an effort to address the training instabilities of GANs, we introduce a class of dual-objective GANs with different value functions (objectives) for the generator (G) and discriminator (D). In particular, we model each objective using $\alpha$-loss, a tunable classification loss, to obtain $(\alpha_D,\alpha_G)$-GANs, parameterized by $(\alpha_D,\alpha_G)\in (0,\infty]^2$. For sufficiently large number of samples and capacities for G and D, we show that the resulting non-zero sum game simplifies to minimizing an $f$-divergence under appropriate conditions on $(\alpha_D,\alpha_G)$. We highlight the value of tuning $(\alpha_D,\alpha_G)$ in alleviating training instabilities for the synthetic 2D Gaussian mixture ring, the Celeb-A, and the LSUN Classroom datasets. Link	Monica Welfert · Kyle Otstot · Gowtham Kurri · Lalitha Sankar 🔗
Fri 1:30 p.m. - 2:00 p.m.	Jihun Hamm ( Keynote ) > SlidesLive Video Bio: Dr. Jihun Hamm has been an Associate Professor of Computer Science at Tulane University since 2019. He received his PhD degree from the University of Pennsylvania in 2008 supervised by Dr. Daniel Lee. Dr. Hamm's research interest is in machine learning, from theory and to applications. He has worked on the theory and practice of robust learning, adversarial learning, privacy and security, optimization, and deep learning. Dr. Hamm also has a background in biomedical engineering and has worked on machine learning applications in medical data analysis. His work in machine learning has been published in top venues such as ICML, NeurIPS, CVPR, JMLR, and IEEE-TPAMI. His work has also been published in medical research venues such as MICCAI, MedIA, and IEEE-TMI. Among other awards, he has earned the Best Paper Award from MedIA, Finalist for MICCAI Young Scientist Publication Impact Award, and Google Faculty Research Award. Title: Analyzing Transfer Learning Bounds through Distributional Robustness Abstract: The success of transfer learning at improving performance, especially with the use of large pre-trained models has made transfer learning an essential tool in the machine learning toolbox. However, the conditions under which performance transferability to downstream tasks is possible are not very well understood. In this talk, I will present several approaches to bounding the target-domain classification loss through distribution shift between the source and the target domains. For domain adaptation/generalization problems where the source and the target task are the same, distribution shift as measured by Wasserstein distance is sufficient to predict the loss bound. Furthermore, distributional robustness improves predictability (i.e., low bound) which may come at the price of performance decrease. For transfer learning where the source and the target task are different, distributions cannot be compared directly. We therefore propose a simple approach that transforms the source distribution (and classifier) by changing the class prior, label, and feature spaces. This allows us to relate the loss of the downstream task (i.e., transferability) to that of the source task. Wasserstein distance again plays an important role in the bound. I will show empirical results using state-of-the-art pre-trained models, and demonstrate how factors such as task relatedness, pretraining method, and model architecture affect transferability.	Jihun Hamm 🔗
Fri 2:00 p.m. - 2:30 p.m.	Kamalika Chaudhuri ( Keynote ) > SlidesLive Video Bio: Kamalika Chaudhuri is a Professor in the department of Computer Science and Engineering at University of California San Diego, and a Research Scientist in the FAIR team at Meta AI. Her research interests are in the foundations of trustworthy machine learning, which includes problems such as learning from sensitive data while preserving privacy, learning under sampling bias, and in the presence of an adversary. She is particularly interested in privacy-preserving machine learning, which addresses how to learn good models and predictors from sensitive data, while preserving the privacy of individuals. Title: Do SSL Models Have Déjà Vu? A Case of Unintended Memorization in Self-supervised Learning Abstract: Self-supervised learning (SSL) algorithms can produce useful image representations by learning to associate different parts of natural images with one another. However, when taken to the extreme, SSL models can unintendedly memorize specific parts in individual training samples rather than learning semantically meaningful associations. In this work, we perform a systematic study of the unintended memorization of image-specific information in SSL models -- which we refer to as déjà vu memorization. Concretely, we show that given the trained model and a crop of a training image containing only the background (e.g., water, sky, grass), it is possible to infer the foreground object with high accuracy or even visually reconstruct it. Furthermore, we show that déjà vu memorization is common to different SSL algorithms, is exacerbated by certain design choices, and cannot be detected by conventional techniques for evaluating representation quality. Our study of déjà vu memorization reveals previously unknown privacy risks in SSL models, as well as suggests potential practical mitigation strategies.	Kamalika Chaudhuri 🔗
Fri 2:30 p.m. - 4:00 p.m.	Posters ( Posters ) >	🔗
Fri 4:00 p.m. - 4:30 p.m.	Atlas Wang ( Keynote ) > SlidesLive Video Bio: Atlas Wang (https://vita-group.github.io/) teaches and researches at UT Austin ECE (primary), CS, and Oden CSEM. He usually declares his research interest as machine learning, but is never too sure what that means concretely. He has won some awards, but is mainly proud of just three things: (1) he has done some (hopefully) thought-invoking and practically meaningful work on sparsity, from inverse problems to deep learning; his recent favorites include “essential sparsity”, “junk DNA hypothesis”, and “heavy-hitter oracle”; (2) he co-founded the Conference on Parsimony and Learning (CPAL), known as the new " conference for sparsity" to its community, and serves as its inaugural program chair; (3) he is fortunate enough to work with a sizable group of world-class students, who are all smarter than himself. He has graduated 10 Ph.D. students that are well placed, including two new assistant professors; and his students have altogether won seven PhD fellowships besides many other honors. Title: On the Complicate Romance between Sparsity and Robustness Abstract: Prior arts have observed that appropriate sparsity (or pruning) can improve the empirical robustness of deep neural networks (NNs). In this talk, I will introduce our recent findings extending this line of research. We have firstly demonstrated that sparsity can be injected into adversarial training, either statically or dynamically, to reduce the robust generalization gap besides significantly saving training and inference FLOPs. We then show that pruning can also improve certified robustness for ReLU-based NNs at scale, under the complete verification setting. Lastly, we theoretically characterize the complicated relationship between neural network sparsity and generalization. It is revealed that, as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero and the network exhibits good generalization. Meanwhile, there also exists a large pruning fraction such that while gradient descent is still able to drive the training loss toward zero (by memorizing noise), the generalization performance is no better than random guessing.	Zhangyang “Atlas” Wang 🔗
Fri 4:30 p.m. - 5:00 p.m.	Stacy Hobson ( Keynote ) > SlidesLive Video Bio: Dr. Stacy Hobson is a Research Scientist at IBM Research and is the Director of the Responsible and Inclusive Technologies research group. Her group’s research focuses on anticipating and understanding the impacts of technology on society and promoting tech practices that minimize harms, biases and other negative outcomes. Stacy’s research has spanned multiple areas including topics such as addressing social inequities through technology, AI transparency, and data sharing platforms for governmental crisis management. Stacy has authored more than 20 peer-reviewed publications and holds 15 US patents. Stacy earned a Bachelor of Science degree in Computer Science from South Carolina State University, a Master of Science degree in Computer Science from Duke University and a PhD in Neuroscience and Cognitive Science from the University of Maryland at College Park. Title: Addressing technology-mediated social harms Abstract: Many technology efforts focus almost exclusively on the expected benefits that the resulting innovations may provide. Although there has been increased attention in past years on topics such as ethics, privacy, fairness and trust in AI, there still exists a wide gap between the aims of responsible innovation and what is occurring most often in practice. In this talk, I highlight the critical importance of proactively considering technology use in society, with focused attention on societal stakeholders, social impacts and socio-historical context, as the necessary foundation to anticipate and mitigate tech harms.	Stacy Fay Hobson 🔗
Fri 5:00 p.m. - 5:10 p.m.	Visual Adversarial Examples Jailbreak Aligned Large Language Models ( Oral ) > link SlidesLive Video The growing interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) like Flamingo and GPT-4, is steering a convergence of vision and language foundation models. Yet, risks associated with this integration are largely unexamined. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the additional visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. To our surprise, we discover that a single visual adversarial example can universally jailbreak an aligned model, inducing it to heed a wide range of harmful instructions and generate harmful content far beyond merely imitating the derogatory corpus used to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. More broadly, our findings connect the long-studied fundamental adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend towards multimodality in frontier foundation models. Link	Xiangyu Qi · Kaixuan Huang · Ashwinee Panda · Mengdi Wang · Prateek Mittal 🔗
Fri 5:10 p.m. - 5:20 p.m.	Learning Shared Safety Constraints from Multi-task Demonstrations ( Oral ) > link SlidesLive Video Regardless of the particular task we want them to perform in an environment, there are often shared safety constraints we want our agents to respect. For example, regardless of whether it is making a sandwich or clearing the table, a kitchen robot should not break a plate. Manually specifying such a constraint can be both time-consuming and error-prone. We show how to learn constraints from expert demonstrations of safe task completion by extending inverse reinforcement learning (IRL) techniques to the space of constraints. Intuitively, we learn constraints that forbid highly rewarding behavior that the expert could have taken but chose not to. Unfortunately, the constraint learning problem is rather ill-posed and typically leads to overly conservative constraints that forbid all behavior that the expert did not take. We counter this by leveraging diverse demonstrations that naturally occur in multi-task settings to learn a tighter set of constraints. We validate our method with simulation experiments on high-dimensional continuous control tasks. Link	Konwoo Kim · Gokul Swamy · Zuxin Liu · Ding Zhao · Sanjiban Choudhury · Steven Wu 🔗
Fri 5:20 p.m. - 5:25 p.m.	MLSMM: Machine Learning Security Maturity Model ( Bluesky Oral ) > link SlidesLive Video Assessing the maturity of security practices during the development of Machine Learning (ML) based software components has not gotten as much attention as traditional software development.In this Blue Sky idea paper, we propose an initial Machine Learning Security Maturity Model (MLSMM) which organizes security practices along the ML-development lifecycle and, for each, establishes three levels of maturity. We envision MLSMM as a step towards closer collaboration between industry and academia. Link	Felix Jedrzejewski · Davide Fucci · Oleksandr Adamov 🔗
Fri 5:25 p.m. - 5:30 p.m.	Deceptive Alignment Monitoring ( Bluesky Oral ) > link SlidesLive Video As the capabilities of large machine learning models continue to grow, and as the autonomy afforded to such models continues to expand, the spectre of a new adversary looms: the models themselves. The threat that a model might behave in a seemingly reasonable manner, while secretly and subtly modifying its behavior for ulterior reasons is often referred to as deceptive alignment in the AI Safety & Alignment communities. Consequently, we call this new direction Deceptive Alignment Monitoring. In this work, we identify emerging directions in diverse machine learning subfields that we believe will become increasingly important and intertwined in the near future for deceptive alignment monitoring, and we argue that advances in these fields present both long-term challenges and new research opportunities. We conclude by advocating for greater involvement by the adversarial machine learning community in these emerging directions. Link	Andres Carranza · Dhruv Pai · Rylan Schaeffer · Arnuv Tandon · Sanmi Koyejo 🔗
Fri 5:30 p.m. - 6:00 p.m.	Aditi Raghunathan ( Keynote ) > SlidesLive Video Bio: Aditi Raghunathan is an Assistant Professor at Carnegie Mellon University. She is interested in building robust ML systems with guarantees for trustworthy real-world deployment. Previously, she was a postdoctoral researcher at Berkeley AI Research, and received her PhD from Stanford University in 2021. Her research has been recognized by the Schmidt AI2050 Early Career Fellowship, the Arthur Samuel Best Thesis Award at Stanford, a Google PhD fellowship in machine learning, and an Open Philanthropy AI fellowship. Title: Beyond Adversaries: Robustness to Distribution Shifts in the Wild Abstract: Machine learning systems often fail catastrophically under the presence of distribution shift—when the test distribution differs in some systematic way from the training distribution. Such shifts can sometimes be captured via an adversarial threat model, but in many cases, there is no convenient threat model that appropriately captures the “real-world” distribution shift. In this talk, we will first discuss how to measure the robustness to such distribution shifts despite the apparent lack of structure. Next, we discuss how to improve robustness to such shifts. The past few years have seen the rise of large models trained on broad data at scale that can be adapted to several downstream tasks (e.g. BERT, GPT, DALL-E). Via theory and experiments, we will see how such models open up new avenues but also require new techniques for improving robustness.	Aditi Raghunathan 🔗
Fri 6:00 p.m. - 6:30 p.m.	Zico Kolter ( Keynote ) > SlidesLive Video Bio: Zico Kolter is an Associate Professor in the Computer Science Department at Carnegie Mellon University, and also serves as chief scientist of AI research for the Bosch Center for Artificial Intelligence. His work spans the intersection of machine learning and optimization, with a large focus on developing more robust and rigorous methods in deep learning. In addition, he has worked in a number of application areas, highlighted by work on sustainability and smart energy systems. He is a recipient of the DARPA Young Faculty Award, a Sloan Fellowship, and best paper awards at NeurIPS, ICML (honorable mention), AISTATS (test of time), IJCAI, KDD, and PESGM. Title: Adversarial Attacks on Aligned LLMs Abstract: In this talk, I'll discuss our recent work on generating adversarial attacks against public LLM tools, such as ChatGPT and Bard. Using combined gradient-based and greedy search on open-source LLMs, we find adversarial suffix strings that cause these models to ignore their "safety alignment" and answer potentially harmful user queries. And most surprisingly, we find that these adversarial prompts transfer amazingly well to closed-source, publicly-available models. I'll discuss the methodology and results of this attack, as well as what this may mean for the future of LLM robustness.	Zico Kolter 🔗
Fri 6:30 p.m. - 6:35 p.m.	How Can Neuroscience Help Us Build More Robust Deep Neural Networks? ( Bluesky Oral ) > link SlidesLive Video Although Deep Neural Networks (DNNs) are often compared to biological visual systems, they are far less robust to natural and adversarial examples. In contrast, biological visual systems can reliably recognize different objects under a variety of settings. While recent innovations have closed the performance gap between biological and artificial vision systems to some extent, there are still many practical differences between the two. In this Blue Sky Ideas presentation, we will identify some key differences between standard DNNs and biological perceptual systems that may contribute to this lack of robustness. We will then present recent work on biologically-plausible, robust DNNs that are derived from and can be easily implemented on physical systems/neuromorphic hardware. Link	Sayanton Dibbo · Siddharth Mansingh · Jocelyn Rego · Garrett T Kenyon · Juston Moore · Michael Teti 🔗
Fri 6:35 p.m. - 6:40 p.m.	The Future of Cyber Systems: Human-AI Reinforcement Learning with Adversarial Robustness ( Bluesky Oral ) > link SlidesLive Video Integrating adversarial machine learning (AML) with cyber data representations that support reinforcement learning would unlock human-ai systems with a capacity to dynamically defend against novel attacks, robustly, at machine speed, and with human intelligence.All machine learning (ML) has an underpinning need for robustness to natural errors and malicious tampering. However, unlike many consumer/commercial models, all ML systems built for cyber will be operating in an inherently adversarial environment with skilled adversaries taking advantage of any flaw. This paper outlines the research challenges, integration points, and programmatic importanceof such a system, while highlighting the social and scientific benefits of pursuing this ambitious program. Link	Nicole Nichols 🔗
Fri 6:40 p.m. - 6:45 p.m.	Announcement of AdvML Rising Star Award ( Announcement ) > SlidesLive Video	🔗
Fri 6:45 p.m. - 7:00 p.m.	Tianlong Chen ( Award presentation ) > SlidesLive Video How Does an Appropriate Sparsity Benefit Robustness?”	🔗
Fri 7:00 p.m. - 7:15 p.m.	Vikash Sehwag ( Award presentation ) > SlidesLive Video Uncovering and Mitigating Privacy Leakage in Large-scale Generative Models	🔗
Fri 7:15 p.m. - 8:00 p.m.	Posters ( Posters ) >	🔗
Fri 8:00 p.m. - 8:00 p.m.	Closing ( Closing ) > Closing remarks	🔗
-	The Challenge of Differentially Private Screening Rules ( Poster ) > link Linear $L_1$-regularized models have remained one of the simplest and most effective tools in data science. Over the past decade, screening rules have risen in popularity as a way to reduce the runtime for producing the sparse regression weights of $L_1$ models. However, despite the increasing need of privacy-preserving models for data analysis, to the best of our knowledge, no differentially private screening rule exists. In this paper, we develop the first differentially private screening rule for linear and logistic regression. In doing so, we discover difficulties in the task of making a useful private screening rule due to the amount of noise added to ensure privacy. We provide theoretical arguments and experimental evidence that this difficulty arises from the screening step itself and not the private optimizer. Based on our results, we highlight that developing an effective private $L_1$ screening method is an open problem in the differential privacy literature. Link	Amol Khanna · Fred Lu · Edward Raff 🔗
-	Benchmarking the Reliability of Post-training Quantization: a Particular Focus on Worst-case Performance ( Poster ) > link The reliability of post-training quantization (PTQ) methods in the face of extreme cases such as distribution shift and data noise remains largely unexplored, despite the popularity of PTQ as a method for compressing deep neural networks (DNNs) without altering their original architecture or training procedures. This paper conducts an investigation on commonly-used PTQ methods, addressing research questions pertaining to the impact of calibration set distribution variations, calibration paradigm selection, and data augmentation or sampling strategies on the reliability of PTQ. Through a systematic evaluation process encompassing various tasks and commonly-used PTQ paradigms, it is evident that the majority of existing PTQ methods lack the necessary reliability for worst-case group performance, underscoring the imperative for more robust approaches. Link	Zhihang Yuan · Jiawei Liu · Jiaxiang Wu · Dawei Yang · Qiang Wu · Guangyu Sun · Wenyu Liu · Xinggang Wang · Bingzhe Wu 🔗
-	Benchmarking Adversarial Robustness of Compressed Deep Learning Models ( Poster ) > link The increasing size of Deep Neural Networks (DNNs) poses a pressing need for model compression, particularly when employed on resource-constrained devices. Concurrently, the susceptibility of DNNs to adversarial attacks presents another significant hurdle. Despite substantial research on both model compression and adversarial robustness, their joint examination remains underexplored.Our study bridges this gap, seeking to understand the effect of adversarial inputs crafted for base models on their pruned versions.To examine this relationship, we have developed a comprehensive benchmark across diverse adversarial attacks and popular DNN models. We uniquely focus on models not previously exposed to adversarial training and apply pruning schemes optimized for accuracy and performance. Our findings reveal that while the benefits of pruning -- enhanced generalizability, compression, and faster inference times -- are preserved, adversarial robustness remains comparable to the base model. This suggests that model compression while offering its unique advantages, does not undermine adversarial robustness. Link	Brijesh Vora · Kartik Patwari · Syed Mahbub Hafiz · Zubair Shafiq · Chen-Nee Chuah 🔗
-	Robustness through Data Augmentation Loss Consistency ( Poster ) > link While deep learning through empirical risk minimization (ERM) has succeeded at achieving human-level performance at a variety of complex tasks, ERM is not robust to distribution shifts or adversarial attacks. Synthetic data augmentation followed by empirical risk minimization (DA-ERM) is a simple and widely used solution to improve robustness in ERM. In addition, consistency regularization can be applied to further improve the robustness of the model by forcing the representation of the original sample and the augmented one to be similar. However, existing consistency regularization methods are not applicable to covariant data augmentation, where the label in the augmented sample is dependent on the augmentation function. In this paper, we propose data augmented loss invariant regularization (DAIR), a simple form of consistency regularization that is applied directly at the loss level rather than intermediate features, making it widely applicable to both invariant and covariant data augmentation regardless of network architecture, problem setup, and task. We apply DAIR to real-world learning problems involving covariant data augmentation: robust neural task-oriented dialog state tracking and robust visual question answering. We also apply DAIR to tasks involving invariant data augmentation: robust regression, robust classification against adversarial attacks, and robust ImageNet classification under distribution shift. Our experiments show that DAIR consistently outperforms ERM and DA-ERM with little marginal computational cost and sets new state-of-the-art results in several benchmarks involving covariant data augmentation. Link	Tianjian Huang · Shaunak Halbe · Chinnadhurai Sankar · Pooyan Amini · Satwik Kottur · Alborz Geramifard · Meisam Razaviyayn · Ahmad Beirami 🔗
-	Expressivity of Graph Neural Networks Through the Lens of Adversarial Robustness ( Poster ) > link We perform the first adversarial robustness study into Graph Neural Networks (GNNs) that are provably more powerful than traditional Message Passing Neural Networks (MPNNs). In particular, we use adversarial robustness as a tool to uncover a significant gap between their theoretically possible and empirically achieved expressive power. To do so, we focus on the ability of GNNs to count specific subgraph patterns, which is an established measure of expressivity, and extend the concept of adversarial robustness to this task. Based on this, we develop efficient adversarial attacks for subgraph counting and show that more powerful GNNs fail to generalize even to small perturbations to the graph's structure. Expanding on this, we show that such architectures also fail to count substructures on out-of-distribution graphs. Link	Francesco Campi · Lukas Gosch · Tom Wollschläger · Yan Scholten · Stephan Günnemann 🔗
-	Provably Robust Cost-Sensitive Learning via Randomized Smoothing ( Poster ) > link We focus on learning adversarially robust classifiers under cost-sensitive scenarios, where the potential harm of different classwise adversarial transformations is encoded in a cost matrix. Existing methods either are empirical that cannot certify robustness or suffer from inherent scalability issues. In this work, we study whether randomized smoothing, a scalable robustness certification framework, can be leveraged to certify cost-sensitive robustness. We first show how to extend the vanilla certification pipeline to provide rigorous guarantees for cost-sensitive robustness. However, when adapting the standard randomized smoothing method to train for cost-sensitive robustness, we observe that the naive reweighting scheme does not achieve a desirable performance due to the indirect optimization of the base classifier. Inspired by this observation, we propose a more direct training method with fine-grained certified radius optimization schemes designed for different data subgroups. Experiments on image benchmarks demonstrate that our method significantly improves certified cost-sensitive robustness without sacrificing overall accuracy. Link	Yuan Xin · Michael Backes · Xiao Zhang 🔗
-	Like Oil and Water: Group Robustness and Poisoning Defenses Don’t Mix ( Poster ) > link Group robustness has become a major concern in machine learning (ML) as conventional training paradigms were found to produce high error on minority groups. Without explicit group annotations, proposed solutions rely on heuristics that aim to identify and then amplify the minority samples during training. In our work, we first uncover a critical shortcoming of these heuristics: an inability to distinguish legitimate minority samples from poison samples in the training set. By amplifying poison samples as well, group robustness methods inadvertently boost the success rate of an adversary---e.g., from 0\% without amplification to over 97\% with it. Moreover, scrutinizing recent poisoning defenses both in centralized and federated learning, we observe that they rely on similar heuristics to identify which samples should be eliminated as poisons. In consequence, minority samples are eliminated along with poisons, which damages group robustness---e.g., from 55\% without the removal of the minority samples to 41\% with it. Finally, as they pursue opposing goals using similar heuristics, our attempts to conciliate group robustness and poisoning defenses come up short. We hope our work highlights how benchmark-driven ML scholarship can obscure the tensions between different metrics, potentially leading to harmful consequences. Link	Michael-Andrei Panaitescu-Liess · Yigitcan Kaya · Tudor Dumitras 🔗
-	Provable Instance Specific Robustness via Linear Constraints ( Poster ) > link Deep Neural Networks (DNNs) trained for classification tasks are vulnerable to adversarial attacks. But not all the classes are equally vulnerable. Adversarial training does not make all classes or groups equally robust as well. For example, in classification tasks with long-tailed distributions, classes are asymmetrically affected during adversarial training, with lower robust accuracy for less frequent classes. In this regard, we propose a provable robustness method by leveraging the continuous piecewise-affine (CPA) nature of DNNs. Our method can impose linearity constraints on the decision boundary, as well as the DNN CPA partition, without requiring any adversarial training. Using such constraints, we show that the margin between the decision boundary and minority classes can be increased in a provable manner. We also present qualitative and quantitative validation of our method for class-specific robustness. Link	Ahmed Imtiaz Humayun · Josue Casco-Rodriguez · Randall Balestriero · Richard Baraniuk 🔗
-	Adversarial Training in Continuous-Time Models and Irregularly Sampled Time-Series ( Poster ) > link This study presents the first steps of exploring the effects of adversarial training on continuous-time models and irregularly sampled time series data. Historically, these models and sampling techniques have been largely neglected in adversarial learning research, leading to a significant gap in our understanding of their performance under adversarial conditions. To address this, we conducted an empirical study of adversarial training techniques applied to time-continuous model architectures and sampling methods. Our findings suggest that while standard continuous-time models tend to outperform their discrete counterparts (especially on irregularly sampled datasets), this performance advantage diminishes almost entirely when adversarial training is employed. This indicates that adversarial training may interfere with the time-continuous representation, effectively neutralizing the benefits typically associated with these models. We believe these insights will be critical in guiding further advancements in adversarial learning research for continuous-time models. Link	Alvin Li · Mathias Lechner · Alexander Amini · Daniela Rus 🔗
-	Few-shot Anomaly Detection via Personalization ( Poster ) > link Even with a plenty amount of normal samples, anomaly detection has been considered as a challenging machine learning task due to its one-class nature, i.e., the lack of anomalous samples in training time. It is only recently that a few-shot regime of anomaly detection became feasible in this regard, e.g., with a help from large vision-language pre-trained models such as CLIP, despite its wide applicability. In this paper, we explore the potential of large text-to-image generative models in performing few-shot anomaly detection. Specifically, recent text-to-image models have shown unprecedented ability to generalize from few images to extract their common and unique concepts, and even encode them into a textual token to "personalize" the model: so-called textual inversion. Here, we question whether this personalization is specific enough to discriminate the given images from their potential anomalies, which are often, e.g., open-ended, local, and hard-to-detect. We observe that the standard textual inversion is not enough for detecting anomalies accurately, and thus we propose a simple-yet an effective regularization scheme to enhance its specificity derived from the zero-shot transferability of CLIP. We also propose a self-tuning scheme to further optimize the performance of our detection pipeline, leveraging synthetic data generated from the personalized generative model. Our experiments show that the proposed inversion scheme could achieve state-of-the-art results on a wide range of few-shot anomaly detection benchmarks. Link	Sangkyung Kwak · Jongheon Jeong · Hankook Lee · Woohyuck Kim · Jinwoo Shin 🔗
-	Rethinking Label Poisoning for GNNs: Pitfalls and Attacks ( Poster ) > link Node labels for graphs are usually generated using an automated process, or crowd-sourced from human users. This opens up avenues for malicious users to compromise the training labels, making it unwise to blindly rely on them. While robustness against noisy labels is an active area of research, there are only a handful of papers in the literature that address this for graph-based data. Even more so, the effects of adversarial label perturbations are sparsely studied. A recent work revealed that the entire literature on label poisoning for GNNs is plagued by serious evaluation pitfalls and showed how existing attacks render ineffective post fixing these shortcomings. In this work, we introduce two new simple yet effective attacks that are significantly stronger (up to $\sim8\%$) than the previous strongest attack. Our work demonstrates the need for more robust defense mechanisms, especially considering the \emph{transferability} of our attacks, where a strategy devised for one model can effectively contaminate numerous other models. Link	Vijay Lingam · Mohammad Sadegh Akhondzadeh · Aleksandar Bojchevski 🔗
-	Shrink & Cert: Bi-level Optimization for Certified Robustness ( Poster ) > link In this paper, we advance the concept of shrinking weights to train certifiably robust models from the fresh perspective of gradient-based bi-level optimization. Lack of robustness against adversarial attacks remains a challenge in safety-critical applications. Many attempts have been made in literature which only provide empirical verification of the defenses to certain attacks and can be easily broken. Methods in other lines of work can only develop certified guarantees of the model robustness in limited scenarios and are computationally expensive. We present a weight shrinkage formulation that is computationally inexpensive and can be solved as a simple first-order optimization problem. We show that model trained with our method has lower Lipschitz bounds in each layer, which directly provides formal guarantees on the certified robustness. We demonstrate that our approach, Shrink \& Cert (SaC) achieves provably robust networks which simultaneously give excellent standard and robust accuracy. We demonstrate the success of our approach on CIFAR-10 and ImageNet datasets and compare them with existing robust training techniques. Code : \url{https://github.com/sagarverma/Shrink-and-Cert} Link	Kavya Gupta · Sagar Verma 🔗
-	Preventing Reward Hacking with Occupancy Measure Regularization ( Poster ) > link Reward hacking occurs when an agent exploits its specified reward function to behave in undesirable or unsafe ways. Aside from better alignment between the specified reward function and the system designer's intentions, a more feasible proposal to prevent reward hacking is to regularize the learned policy to some safe baseline. Current research suggests that regularizing the learned policy's action distributions to be more similar to those of a safe policy can mitigate reward hacking; however, this approach fails to take into account the disproportionate impact that some actions have on the agent’s state. Instead, we propose a method of regularization based on occupancy measures, which capture the proportion of time each policy is in a particular state-action pair during trajectories. We show theoretically that occupancy-based regularization avoids many drawbacks of action distribution-based regularization, and we introduce an algorithm called ORPO to practically implement our technique. We then empirically demonstrate that occupancy measure-based regularization is superior in both a simple gridworld and a more complex autonomous vehicle control environment. Link	Cassidy Laidlaw · Shivam Singhal · Anca Dragan 🔗
-	Baselines for Identifying Watermarked Large Language Models ( Poster ) > link We consider the emerging problem of identifying the presence of watermarking schemes in publicly hosted, closed source large language models (LLMs). Rather than determine if a given text is generated by a watermarked language model, we seek to answer the question of if the model itself is watermarked. We introduce a suite of baseline algorithms for identifying watermarks in LLMs that rely on analyzing distributions of output tokens and logits generated by watermarked and unmarked LLMs. Notably, watermarked LLMs tend to produce token distributions that diverge qualitatively and identifiably from standard models. Furthermore, we investigate the identifiability of watermarks at varying strengths and consider the tradeoffs of each of our identification mechanisms with respect to watermarking scenario. Link	Leonard Tang · Gavin Uberti · Tom Shlomi 🔗
-	Why do universal adversarial attacks work on large language models?: Geometry might be the answer ( Poster ) > link Transformer based large language models with emergent capabilities are becoming increasingly ubiquitous in society. However, the task of understanding and interpreting their internal workings, in the context of adversarial attacks, remains largely unsolved. Gradient-based universal adversarial attacks have been shown to be highly effective on large language models and potentially dangerous due to their input-agnostic nature. This work presents a novel geometric perspective explaining universal adversarial attacks on large language models. By attacking the 117M parameter GPT-2 model, we find evidence indicating that universal adversarial triggers could be embedding vectors which merely approximate the semantic information in their adversarial training region. This hypothesis is supported by white-box model analysis comprising dimensionality reduction and similarity measurement of hidden representations. We believe this new geometric perspective on the underlying mechanism driving universal attacks could help us gain deeper insight into the internal workings and failure modes of LLMs, thus enabling their mitigation. Link	Varshini Subhash · Anna Bialas · Siddharth Swaroop · Weiwei Pan · Finale Doshi-Velez 🔗
-	FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation ( Poster ) > link We present FACADE, a novel probabilistic and geometric framework designed for unsupervised mechanistic anomaly detection in deep neural networks. Its primary goal is advancing the understanding and mitigation of adversarial attacks. FACADE aims to generate probabilistic distributions over circuits, which provide critical insights to their contribution to changes in the manifold properties of pseudo-classes, or high-dimensional modes in activation space, yielding a powerful tool for uncovering and combating adversarial attacks. Our approach seeks to improve model robustness, enhance scalable model oversight, and demonstrates promising applications in real-world deployment settings. Link	Dhruv Pai · Andres Carranza · Rylan Schaeffer · Arnuv Tandon · Sanmi Koyejo 🔗

Workshop

2nd ICML Workshop on New Frontiers in Adversarial Machine Learning

Sijia Liu · Pin-Yu Chen · Dongxiao Zhu · Eric Wong · Kathrin Grosse · Baharan Mirzasoleiman · Sanmi Koyejo

Ballroom A

Schedule