Workshop
2nd ICML Workshop on New Frontiers in Adversarial Machine Learning
Sijia Liu · Pin-Yu Chen · Dongxiao Zhu · Eric Wong · Kathrin Grosse · Baharan Mirzasoleiman · Sanmi Koyejo
Ballroom A
Given the success of AdvML-inspired research, we propose a new edition from our workshop at ICML’22 (AdvML-Frontiers’22), ‘The 2nd Workshop on New Frontiers in AdvML’ (AdvML-Frontiers’23). We target a high-quality international workshop, coupled with new scientific activities, networking opportunities, and enjoyable social events. Scientifically, we aim to identify the challenges and limitations of current AdvML methods and explore new prospective and constructive views for next-generation AdvML across the full theory/algorithm/application stack. As the sequel to AdvML-Frontiers’22, we will continue exploring the new frontiers of AdvML in theoretical understanding, scalable algorithm and system designs, and scientific development that transcends traditional disciplinary boundaries. We will also add new features and programs in 2023. First, we will expand existing research themes, particularly considering the popularity of large foundational models (e.g., DALL-E 2, Stable Diffusion, and ChatGPT). Examples of topics include AdvML for prompt learning, counteracting AI-synthesized fake images and texts, debugging ML from unified data-model perspectives, and ‘green’ AdvML towards environmental sustainability. Second, we will organize a new section, AI Trust in Industry, by inviting industry experts to introduce the practical trend of AdvML, technological innovations, products, and societal impacts (e.g., AI’s responsibility). Third, we will host a Show-and-Tell Demos in the poster session to allow demonstrations of innovations done by research and engineering groups in the industry, academia, and government entities. Fourth, we will collaborate with ‘Black in AI’ (where Co-Organizer Dr. Sanmi Koyejo is serving as the president) to increase the presence and inclusion of Black people in the field of AdvML by creating spaces for sharing ideas and networking.
Schedule
Fri 11:50 a.m. - 12:00 p.m.
|
Opening
(
Opening
)
>
SlidesLive Video The opening remarks of the workshop. |
🔗 |
Fri 12:00 p.m. - 12:30 p.m.
|
Una-May O'Reilly
(
Keynote
)
>
SlidesLive Video Bio: Una-May O'Reilly is the leader of ALFA Group at MIT-CSAIL. An AI and machine learning researcher for 20+ years, she is broadly interested in artificial adversarial intelligence -- the notion that competition has complex dynamics due to learning and adaptation signaled by experiential feedback. This interest directs her to the topic of security where she has developed machine learning algorithms that variously consider the arms races of malware, network and model attacks and the uses of adversarial inputs on deep learning models. Her passions are evolutionary computation and programming. This frequently leads her to investigate Genetic Programming. As well, it draws her to investigations of coevolutionary dynamics between populations of cooperative agents or adversaries, in settings as general as cybersecurity and machine learning. Talk: Adversarial Intelligence Supported by Machine Learning Abstract: My interest is in computationally replicating the behavior of adversaries who target algorithms/code/scripts at vulnerable targets and the defenders who try to stop the threats. I typically consider networks as targets but let's consider the most recent ML models - foundation models. How do goals blur in the current context where the community is trying to simultaneously address their safety and security? |
Una-May O'Reilly 🔗 |
Fri 12:30 p.m. - 1:00 p.m.
|
Lea Schönherr
(
Keynote
)
>
SlidesLive Video Bio: Lea Schönherr is a tenure track faculty at CISPA Helmholtz Center for Information Security since 2022. She obtained her PhD from Ruhr-Universität Bochum, Germany, in 2021 and is a recipient of two fellowships from UbiCrypt (DFG Graduate School) and Casa (DFG Cluster of Excellence). Her research interests are in the area of information security with a focus on adversarial machine learning and generative models to defend against real-world threats. She is particularly interested in language as an interface to machine learning models and in combining different domains such as audio, text, and images. She has published several papers on threat detection and defense of speech recognition systems and generative models. Title: Brave New World: Challenges and Threats in Multimodal AI Agent Integrations Abstract: Being on the rise, AI agents become more integrated into our daily lives and will soon be indispensable for countless downstream tasks, be it translation, text enhancing, summarisation or other assisting applications like code generation. As of today, the human-agent interface is no longer limited to plain text and large language models (LLMs) can handle documents, videos, images, audio and more. In addition, the generation of various multimodal outputs is becoming more advanced and realistic in appearance, allowing for more sophisticated communication with AI agents. Particularly in the future, agents will rely on a more natural-feeling voice interface for interactions with AI agents. In this presentation, we will take a closer look at the resulting challenges and security threats associated with integrated multimodal AI agents, which relate to two possible categories: Malicious inputs used to jailbreak LLMs, as well as computer-generated output that is indistinguishable from human-generated content. In the first case, specially designed inputs are used to exploit an LLM or its embedding system, also referred to as prompt hacking. Existing attacks show that content filters of LLMs can be easily bypassed with specific inputs and that private information can be leaked. The use of additional input modalities, such as speech, allows for a much broader potential attack surface that needs to be investigated and protected. In the second case, generative models are utilized to produce fake content that is nearly impossible to distinguish from human-generated content. This fake content is often used for fraudulent and manipulative purposes and impersonation and realistic fake news is already possible using a variety of techniques. As these models continue to evolve, detecting these fraudulent activities will become increasingly difficult, while the attacks themselves will become easier to automate and require less expertise. This creates significant challenges for preventing fraud and the uncontrolled spread of fake news. |
Lea Schönherr 🔗 |
Fri 1:00 p.m. - 1:10 p.m.
|
Adversarial Training Should Be Cast as a Non-Zero-Sum Game
(
Oral
)
>
link
SlidesLive Video One prominent approach toward resolving the adversarial vulnerability of deep neural networks is the two-player zero-sum paradigm of adversarial training, in which predictors are trained against adversarially-chosen perturbations of data. Despite the promise of this approach, algorithms based on this paradigm have not engendered sufficient levels of robustness, and suffer from pathological behavior like robust overfitting. To understand this shortcoming, we first show that the commonly used surrogate-based relaxation used in adversarial training algorithms voids all guarantees on the robustness of trained classifiers. The identification of this pitfall informs a novel non-zero-sum bilevel formulation of adversarial training, wherein each player optimizes a different objective function. Our formulation naturally yields a simple algorithmic framework that matches and in some cases outperforms state-of-the-art attacks, attains comparable levels of robustness to standard adversarial training algorithms, and does not suffer from robust overfitting. |
Alex Robey · Fabian Latorre · George J. Pappas · Hamed Hassani · Volkan Cevher 🔗 |
Fri 1:10 p.m. - 1:20 p.m.
|
Evading Black-box Classifiers Without Breaking Eggs
(
Oral
)
>
link
SlidesLive Video
Decision-based evasion attacks repeatedly query a black-box classifier to generate adversarial examples.Prior work measures the cost of such attacks by the total number of queries made to the classifier. We argue this metric is flawed. Most security-critical machine learning systems aim to weed out "bad" data (e.g., malware, harmful content, etc). Queries to such systems carry a fundamentally *asymmetric cost*: queries detected as "bad" come at a higher cost because they trigger additional security filters, e.g., usage throttling or account suspension. Yet, we find that existing decision-based attacks issue a large number of "bad" queries, which likely renders them ineffective against security-critical systems. We then design new attacks that reduce the number of bad queries by $1.5$-$7.3\times$, but often at a significant increase in total (non-bad) queries. We thus pose it as an open problem to build black-box attacks that are more effective under realistic cost metrics.
|
Edoardo Debenedetti · Nicholas Carlini · Florian Tramer 🔗 |
Fri 1:20 p.m. - 1:30 p.m.
|
Tunable Dual-Objective GANs for Stable Training
(
Oral
)
>
link
SlidesLive Video
In an effort to address the training instabilities of GANs, we introduce a class of dual-objective GANs with different value functions (objectives) for the generator (G) and discriminator (D). In particular, we model each objective using $\alpha$-loss, a tunable classification loss, to obtain $(\alpha_D,\alpha_G)$-GANs, parameterized by $(\alpha_D,\alpha_G)\in (0,\infty]^2$. For sufficiently large number of samples and capacities for G and D, we show that the resulting non-zero sum game simplifies to minimizing an $f$-divergence under appropriate conditions on $(\alpha_D,\alpha_G)$. We highlight the value of tuning $(\alpha_D,\alpha_G)$ in alleviating training instabilities for the synthetic 2D Gaussian mixture ring, the Celeb-A, and the LSUN Classroom datasets.
|
Monica Welfert · Kyle Otstot · Gowtham Kurri · Lalitha Sankar 🔗 |
Fri 1:30 p.m. - 2:00 p.m.
|
Jihun Hamm
(
Keynote
)
>
SlidesLive Video Bio: Dr. Jihun Hamm has been an Associate Professor of Computer Science at Tulane University since 2019. He received his PhD degree from the University of Pennsylvania in 2008 supervised by Dr. Daniel Lee. Dr. Hamm's research interest is in machine learning, from theory and to applications. He has worked on the theory and practice of robust learning, adversarial learning, privacy and security, optimization, and deep learning. Dr. Hamm also has a background in biomedical engineering and has worked on machine learning applications in medical data analysis. His work in machine learning has been published in top venues such as ICML, NeurIPS, CVPR, JMLR, and IEEE-TPAMI. His work has also been published in medical research venues such as MICCAI, MedIA, and IEEE-TMI. Among other awards, he has earned the Best Paper Award from MedIA, Finalist for MICCAI Young Scientist Publication Impact Award, and Google Faculty Research Award. Title: Analyzing Transfer Learning Bounds through Distributional Robustness Abstract: The success of transfer learning at improving performance, especially with the use of large pre-trained models has made transfer learning an essential tool in the machine learning toolbox. However, the conditions under which performance transferability to downstream tasks is possible are not very well understood. In this talk, I will present several approaches to bounding the target-domain classification loss through distribution shift between the source and the target domains. For domain adaptation/generalization problems where the source and the target task are the same, distribution shift as measured by Wasserstein distance is sufficient to predict the loss bound. Furthermore, distributional robustness improves predictability (i.e., low bound) which may come at the price of performance decrease. For transfer learning where the source and the target task are different, distributions cannot be compared directly. We therefore propose a simple approach that transforms the source distribution (and classifier) by changing the class prior, label, and feature spaces. This allows us to relate the loss of the downstream task (i.e., transferability) to that of the source task. Wasserstein distance again plays an important role in the bound. I will show empirical results using state-of-the-art pre-trained models, and demonstrate how factors such as task relatedness, pretraining method, and model architecture affect transferability. |
Jihun Hamm 🔗 |
Fri 2:00 p.m. - 2:30 p.m.
|
Kamalika Chaudhuri
(
Keynote
)
>
SlidesLive Video Bio: Kamalika Chaudhuri is a Professor in the department of Computer Science and Engineering at University of California San Diego, and a Research Scientist in the FAIR team at Meta AI. Her research interests are in the foundations of trustworthy machine learning, which includes problems such as learning from sensitive data while preserving privacy, learning under sampling bias, and in the presence of an adversary. She is particularly interested in privacy-preserving machine learning, which addresses how to learn good models and predictors from sensitive data, while preserving the privacy of individuals. Title: Do SSL Models Have Déjà Vu? A Case of Unintended Memorization in Self-supervised Learning Abstract: Self-supervised learning (SSL) algorithms can produce useful image representations by learning to associate different parts of natural images with one another. However, when taken to the extreme, SSL models can unintendedly memorize specific parts in individual training samples rather than learning semantically meaningful associations. In this work, we perform a systematic study of the unintended memorization of image-specific information in SSL models -- which we refer to as déjà vu memorization. Concretely, we show that given the trained model and a crop of a training image containing only the background (e.g., water, sky, grass), it is possible to infer the foreground object with high accuracy or even visually reconstruct it. Furthermore, we show that déjà vu memorization is common to different SSL algorithms, is exacerbated by certain design choices, and cannot be detected by conventional techniques for evaluating representation quality. Our study of déjà vu memorization reveals previously unknown privacy risks in SSL models, as well as suggests potential practical mitigation strategies. |
Kamalika Chaudhuri 🔗 |
Fri 2:30 p.m. - 4:00 p.m.
|
Posters
(
Posters
)
>
|
🔗 |
Fri 4:00 p.m. - 4:30 p.m.
|
Atlas Wang
(
Keynote
)
>
SlidesLive Video Bio: Atlas Wang (https://vita-group.github.io/) teaches and researches at UT Austin ECE (primary), CS, and Oden CSEM. He usually declares his research interest as machine learning, but is never too sure what that means concretely. He has won some awards, but is mainly proud of just three things: (1) he has done some (hopefully) thought-invoking and practically meaningful work on sparsity, from inverse problems to deep learning; his recent favorites include “essential sparsity”, “junk DNA hypothesis”, and “heavy-hitter oracle”; (2) he co-founded the Conference on Parsimony and Learning (CPAL), known as the new " conference for sparsity" to its community, and serves as its inaugural program chair; (3) he is fortunate enough to work with a sizable group of world-class students, who are all smarter than himself. He has graduated 10 Ph.D. students that are well placed, including two new assistant professors; and his students have altogether won seven PhD fellowships besides many other honors. Title: On the Complicate Romance between Sparsity and Robustness Abstract: Prior arts have observed that appropriate sparsity (or pruning) can improve the empirical robustness of deep neural networks (NNs). In this talk, I will introduce our recent findings extending this line of research. We have firstly demonstrated that sparsity can be injected into adversarial training, either statically or dynamically, to reduce the robust generalization gap besides significantly saving training and inference FLOPs. We then show that pruning can also improve certified robustness for ReLU-based NNs at scale, under the complete verification setting. Lastly, we theoretically characterize the complicated relationship between neural network sparsity and generalization. It is revealed that, as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero and the network exhibits good generalization. Meanwhile, there also exists a large pruning fraction such that while gradient descent is still able to drive the training loss toward zero (by memorizing noise), the generalization performance is no better than random guessing. |
Zhangyang “Atlas” Wang 🔗 |
Fri 4:30 p.m. - 5:00 p.m.
|
Stacy Hobson
(
Keynote
)
>
SlidesLive Video Bio: Dr. Stacy Hobson is a Research Scientist at IBM Research and is the Director of the Responsible and Inclusive Technologies research group. Her group’s research focuses on anticipating and understanding the impacts of technology on society and promoting tech practices that minimize harms, biases and other negative outcomes. Stacy’s research has spanned multiple areas including topics such as addressing social inequities through technology, AI transparency, and data sharing platforms for governmental crisis management. Stacy has authored more than 20 peer-reviewed publications and holds 15 US patents. Stacy earned a Bachelor of Science degree in Computer Science from South Carolina State University, a Master of Science degree in Computer Science from Duke University and a PhD in Neuroscience and Cognitive Science from the University of Maryland at College Park. Title: Addressing technology-mediated social harms Abstract: Many technology efforts focus almost exclusively on the expected benefits that the resulting innovations may provide. Although there has been increased attention in past years on topics such as ethics, privacy, fairness and trust in AI, there still exists a wide gap between the aims of responsible innovation and what is occurring most often in practice. In this talk, I highlight the critical importance of proactively considering technology use in society, with focused attention on societal stakeholders, social impacts and socio-historical context, as the necessary foundation to anticipate and mitigate tech harms. |
Stacy Fay Hobson 🔗 |
Fri 5:00 p.m. - 5:10 p.m.
|
Visual Adversarial Examples Jailbreak Aligned Large Language Models
(
Oral
)
>
link
SlidesLive Video The growing interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) like Flamingo and GPT-4, is steering a convergence of vision and language foundation models. Yet, risks associated with this integration are largely unexamined. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the additional visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. To our surprise, we discover that a single visual adversarial example can universally jailbreak an aligned model, inducing it to heed a wide range of harmful instructions and generate harmful content far beyond merely imitating the derogatory corpus used to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. More broadly, our findings connect the long-studied fundamental adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend towards multimodality in frontier foundation models. |
Xiangyu Qi · Kaixuan Huang · Ashwinee Panda · Mengdi Wang · Prateek Mittal 🔗 |
Fri 5:10 p.m. - 5:20 p.m.
|
Learning Shared Safety Constraints from Multi-task Demonstrations
(
Oral
)
>
link
SlidesLive Video Regardless of the particular task we want them to perform in an environment, there are often shared safety constraints we want our agents to respect. For example, regardless of whether it is making a sandwich or clearing the table, a kitchen robot should not break a plate. Manually specifying such a constraint can be both time-consuming and error-prone. We show how to learn constraints from expert demonstrations of safe task completion by extending inverse reinforcement learning (IRL) techniques to the space of constraints. Intuitively, we learn constraints that forbid highly rewarding behavior that the expert could have taken but chose not to. Unfortunately, the constraint learning problem is rather ill-posed and typically leads to overly conservative constraints that forbid all behavior that the expert did not take. We counter this by leveraging diverse demonstrations that naturally occur in multi-task settings to learn a tighter set of constraints. We validate our method with simulation experiments on high-dimensional continuous control tasks. |
Konwoo Kim · Gokul Swamy · Zuxin Liu · Ding Zhao · Sanjiban Choudhury · Steven Wu 🔗 |
Fri 5:20 p.m. - 5:25 p.m.
|
MLSMM: Machine Learning Security Maturity Model
(
Bluesky Oral
)
>
link
SlidesLive Video Assessing the maturity of security practices during the development of Machine Learning (ML) based software components has not gotten as much attention as traditional software development.In this Blue Sky idea paper, we propose an initial Machine Learning Security Maturity Model (MLSMM) which organizes security practices along the ML-development lifecycle and, for each, establishes three levels of maturity. We envision MLSMM as a step towards closer collaboration between industry and academia. |
Felix Jedrzejewski · Davide Fucci · Oleksandr Adamov 🔗 |
Fri 5:25 p.m. - 5:30 p.m.
|
Deceptive Alignment Monitoring
(
Bluesky Oral
)
>
link
SlidesLive Video As the capabilities of large machine learning models continue to grow, and as the autonomy afforded to such models continues to expand, the spectre of a new adversary looms: the models themselves. The threat that a model might behave in a seemingly reasonable manner, while secretly and subtly modifying its behavior for ulterior reasons is often referred to as deceptive alignment in the AI Safety & Alignment communities. Consequently, we call this new direction Deceptive Alignment Monitoring. In this work, we identify emerging directions in diverse machine learning subfields that we believe will become increasingly important and intertwined in the near future for deceptive alignment monitoring, and we argue that advances in these fields present both long-term challenges and new research opportunities. We conclude by advocating for greater involvement by the adversarial machine learning community in these emerging directions. |
Andres Carranza · Dhruv Pai · Rylan Schaeffer · Arnuv Tandon · Sanmi Koyejo 🔗 |
Fri 5:30 p.m. - 6:00 p.m.
|
Aditi Raghunathan
(
Keynote
)
>
SlidesLive Video Bio: Aditi Raghunathan is an Assistant Professor at Carnegie Mellon University. She is interested in building robust ML systems with guarantees for trustworthy real-world deployment. Previously, she was a postdoctoral researcher at Berkeley AI Research, and received her PhD from Stanford University in 2021. Her research has been recognized by the Schmidt AI2050 Early Career Fellowship, the Arthur Samuel Best Thesis Award at Stanford, a Google PhD fellowship in machine learning, and an Open Philanthropy AI fellowship. Title: Beyond Adversaries: Robustness to Distribution Shifts in the Wild Abstract: Machine learning systems often fail catastrophically under the presence of distribution shift—when the test distribution differs in some systematic way from the training distribution. Such shifts can sometimes be captured via an adversarial threat model, but in many cases, there is no convenient threat model that appropriately captures the “real-world” distribution shift. In this talk, we will first discuss how to measure the robustness to such distribution shifts despite the apparent lack of structure. Next, we discuss how to improve robustness to such shifts. The past few years have seen the rise of large models trained on broad data at scale that can be adapted to several downstream tasks (e.g. BERT, GPT, DALL-E). Via theory and experiments, we will see how such models open up new avenues but also require new techniques for improving robustness. |
Aditi Raghunathan 🔗 |
Fri 6:00 p.m. - 6:30 p.m.
|
Zico Kolter
(
Keynote
)
>
SlidesLive Video Bio: Zico Kolter is an Associate Professor in the Computer Science Department at Carnegie Mellon University, and also serves as chief scientist of AI research for the Bosch Center for Artificial Intelligence. His work spans the intersection of machine learning and optimization, with a large focus on developing more robust and rigorous methods in deep learning. In addition, he has worked in a number of application areas, highlighted by work on sustainability and smart energy systems. He is a recipient of the DARPA Young Faculty Award, a Sloan Fellowship, and best paper awards at NeurIPS, ICML (honorable mention), AISTATS (test of time), IJCAI, KDD, and PESGM. Title: Adversarial Attacks on Aligned LLMs Abstract: In this talk, I'll discuss our recent work on generating adversarial attacks against public LLM tools, such as ChatGPT and Bard. Using combined gradient-based and greedy search on open-source LLMs, we find adversarial suffix strings that cause these models to ignore their "safety alignment" and answer potentially harmful user queries. And most surprisingly, we find that these adversarial prompts transfer amazingly well to closed-source, publicly-available models. I'll discuss the methodology and results of this attack, as well as what this may mean for the future of LLM robustness. |
Zico Kolter 🔗 |
Fri 6:30 p.m. - 6:35 p.m.
|
How Can Neuroscience Help Us Build More Robust Deep Neural Networks?
(
Bluesky Oral
)
>
link
SlidesLive Video Although Deep Neural Networks (DNNs) are often compared to biological visual systems, they are far less robust to natural and adversarial examples. In contrast, biological visual systems can reliably recognize different objects under a variety of settings. While recent innovations have closed the performance gap between biological and artificial vision systems to some extent, there are still many practical differences between the two. In this Blue Sky Ideas presentation, we will identify some key differences between standard DNNs and biological perceptual systems that may contribute to this lack of robustness. We will then present recent work on biologically-plausible, robust DNNs that are derived from and can be easily implemented on physical systems/neuromorphic hardware. |
Sayanton Dibbo · Siddharth Mansingh · Jocelyn Rego · Garrett T Kenyon · Juston Moore · Michael Teti 🔗 |
Fri 6:35 p.m. - 6:40 p.m.
|
The Future of Cyber Systems: Human-AI Reinforcement Learning with Adversarial Robustness
(
Bluesky Oral
)
>
link
SlidesLive Video Integrating adversarial machine learning (AML) with cyber data representations that support reinforcement learning would unlock human-ai systems with a capacity to dynamically defend against novel attacks, robustly, at machine speed, and with human intelligence.All machine learning (ML) has an underpinning need for robustness to natural errors and malicious tampering. However, unlike many consumer/commercial models, all ML systems built for cyber will be operating in an inherently adversarial environment with skilled adversaries taking advantage of any flaw. This paper outlines the research challenges, integration points, and programmatic importanceof such a system, while highlighting the social and scientific benefits of pursuing this ambitious program. |
Nicole Nichols 🔗 |
Fri 6:40 p.m. - 6:45 p.m.
|
Announcement of AdvML Rising Star Award
(
Announcement
)
>
SlidesLive Video |
🔗 |
Fri 6:45 p.m. - 7:00 p.m.
|
Tianlong Chen
(
Award presentation
)
>
SlidesLive Video How Does an Appropriate Sparsity Benefit Robustness?” |
🔗 |
Fri 7:00 p.m. - 7:15 p.m.
|
Vikash Sehwag
(
Award presentation
)
>
SlidesLive Video Uncovering and Mitigating Privacy Leakage in Large-scale Generative Models |
🔗 |
Fri 7:15 p.m. - 8:00 p.m.
|
Posters
(
Posters
)
>
|
🔗 |
Fri 8:00 p.m. - 8:00 p.m.
|
Closing
(
Closing
)
>
Closing remarks |
🔗 |
-
|
The Challenge of Differentially Private Screening Rules
(
Poster
)
>
link
Linear $L_1$-regularized models have remained one of the simplest and most effective tools in data science. Over the past decade, screening rules have risen in popularity as a way to reduce the runtime for producing the sparse regression weights of $L_1$ models. However, despite the increasing need of privacy-preserving models for data analysis, to the best of our knowledge, no differentially private screening rule exists. In this paper, we develop the first differentially private screening rule for linear and logistic regression. In doing so, we discover difficulties in the task of making a useful private screening rule due to the amount of noise added to ensure privacy. We provide theoretical arguments and experimental evidence that this difficulty arises from the screening step itself and not the private optimizer. Based on our results, we highlight that developing an effective private $L_1$ screening method is an open problem in the differential privacy literature.
|
Amol Khanna · Fred Lu · Edward Raff 🔗 |
-
|
Benchmarking the Reliability of Post-training Quantization: a Particular Focus on Worst-case Performance
(
Poster
)
>
link
The reliability of post-training quantization (PTQ) methods in the face of extreme cases such as distribution shift and data noise remains largely unexplored, despite the popularity of PTQ as a method for compressing deep neural networks (DNNs) without altering their original architecture or training procedures. This paper conducts an investigation on commonly-used PTQ methods, addressing research questions pertaining to the impact of calibration set distribution variations, calibration paradigm selection, and data augmentation or sampling strategies on the reliability of PTQ. Through a systematic evaluation process encompassing various tasks and commonly-used PTQ paradigms, it is evident that the majority of existing PTQ methods lack the necessary reliability for worst-case group performance, underscoring the imperative for more robust approaches. |
Zhihang Yuan · Jiawei Liu · Jiaxiang Wu · Dawei Yang · Qiang Wu · Guangyu Sun · Wenyu Liu · Xinggang Wang · Bingzhe Wu 🔗 |
-
|
Benchmarking Adversarial Robustness of Compressed Deep Learning Models
(
Poster
)
>
link
The increasing size of Deep Neural Networks (DNNs) poses a pressing need for model compression, particularly when employed on resource-constrained devices. Concurrently, the susceptibility of DNNs to adversarial attacks presents another significant hurdle. Despite substantial research on both model compression and adversarial robustness, their joint examination remains underexplored.Our study bridges this gap, seeking to understand the effect of adversarial inputs crafted for base models on their pruned versions.To examine this relationship, we have developed a comprehensive benchmark across diverse adversarial attacks and popular DNN models. We uniquely focus on models not previously exposed to adversarial training and apply pruning schemes optimized for accuracy and performance. Our findings reveal that while the benefits of pruning -- enhanced generalizability, compression, and faster inference times -- are preserved, adversarial robustness remains comparable to the base model. This suggests that model compression while offering its unique advantages, does not undermine adversarial robustness. |
Brijesh Vora · Kartik Patwari · Syed Mahbub Hafiz · Zubair Shafiq · Chen-Nee Chuah 🔗 |
-
|
Robustness through Data Augmentation Loss Consistency
(
Poster
)
>
link
While deep learning through empirical risk minimization (ERM) has succeeded at achieving human-level performance at a variety of complex tasks, ERM is not robust to distribution shifts or adversarial attacks. Synthetic data augmentation followed by empirical risk minimization (DA-ERM) is a simple and widely used solution to improve robustness in ERM. In addition, consistency regularization can be applied to further improve the robustness of the model by forcing the representation of the original sample and the augmented one to be similar. However, existing consistency regularization methods are not applicable to covariant data augmentation, where the label in the augmented sample is dependent on the augmentation function. In this paper, we propose data augmented loss invariant regularization (DAIR), a simple form of consistency regularization that is applied directly at the loss level rather than intermediate features, making it widely applicable to both invariant and covariant data augmentation regardless of network architecture, problem setup, and task. We apply DAIR to real-world learning problems involving covariant data augmentation: robust neural task-oriented dialog state tracking and robust visual question answering. We also apply DAIR to tasks involving invariant data augmentation: robust regression, robust classification against adversarial attacks, and robust ImageNet classification under distribution shift. Our experiments show that DAIR consistently outperforms ERM and DA-ERM with little marginal computational cost and sets new state-of-the-art results in several benchmarks involving covariant data augmentation. |
Tianjian Huang · Shaunak Halbe · Chinnadhurai Sankar · Pooyan Amini · Satwik Kottur · Alborz Geramifard · Meisam Razaviyayn · Ahmad Beirami 🔗 |
-
|
Expressivity of Graph Neural Networks Through the Lens of Adversarial Robustness
(
Poster
)
>
link
We perform the first adversarial robustness study into Graph Neural Networks (GNNs) that are provably more powerful than traditional Message Passing Neural Networks (MPNNs). In particular, we use adversarial robustness as a tool to uncover a significant gap between their theoretically possible and empirically achieved expressive power. To do so, we focus on the ability of GNNs to count specific subgraph patterns, which is an established measure of expressivity, and extend the concept of adversarial robustness to this task. Based on this, we develop efficient adversarial attacks for subgraph counting and show that more powerful GNNs fail to generalize even to small perturbations to the graph's structure. Expanding on this, we show that such architectures also fail to count substructures on out-of-distribution graphs. |
Francesco Campi · Lukas Gosch · Tom Wollschläger · Yan Scholten · Stephan Günnemann 🔗 |
-
|
Provably Robust Cost-Sensitive Learning via Randomized Smoothing
(
Poster
)
>
link
We focus on learning adversarially robust classifiers under cost-sensitive scenarios, where the potential harm of different classwise adversarial transformations is encoded in a cost matrix. Existing methods either are empirical that cannot certify robustness or suffer from inherent scalability issues. In this work, we study whether randomized smoothing, a scalable robustness certification framework, can be leveraged to certify cost-sensitive robustness. We first show how to extend the vanilla certification pipeline to provide rigorous guarantees for cost-sensitive robustness. However, when adapting the standard randomized smoothing method to train for cost-sensitive robustness, we observe that the naive reweighting scheme does not achieve a desirable performance due to the indirect optimization of the base classifier. Inspired by this observation, we propose a more direct training method with fine-grained certified radius optimization schemes designed for different data subgroups. Experiments on image benchmarks demonstrate that our method significantly improves certified cost-sensitive robustness without sacrificing overall accuracy. |
Yuan Xin · Michael Backes · Xiao Zhang 🔗 |
-
|
Like Oil and Water: Group Robustness and Poisoning Defenses Don’t Mix
(
Poster
)
>
link
Group robustness has become a major concern in machine learning (ML) as conventional training paradigms were found to produce high error on minority groups. Without explicit group annotations, proposed solutions rely on heuristics that aim to identify and then amplify the minority samples during training. In our work, we first uncover a critical shortcoming of these heuristics: an inability to distinguish legitimate minority samples from poison samples in the training set. By amplifying poison samples as well, group robustness methods inadvertently boost the success rate of an adversary---e.g., from 0\% without amplification to over 97\% with it. Moreover, scrutinizing recent poisoning defenses both in centralized and federated learning, we observe that they rely on similar heuristics to identify which samples should be eliminated as poisons. In consequence, minority samples are eliminated along with poisons, which damages group robustness---e.g., from 55\% without the removal of the minority samples to 41\% with it. Finally, as they pursue opposing goals using similar heuristics, our attempts to conciliate group robustness and poisoning defenses come up short. We hope our work highlights how benchmark-driven ML scholarship can obscure the tensions between different metrics, potentially leading to harmful consequences. |
Michael-Andrei Panaitescu-Liess · Yigitcan Kaya · Tudor Dumitras 🔗 |
-
|
Provable Instance Specific Robustness via Linear Constraints
(
Poster
)
>
link
Deep Neural Networks (DNNs) trained for classification tasks are vulnerable to adversarial attacks. But not all the classes are equally vulnerable. Adversarial training does not make all classes or groups equally robust as well. For example, in classification tasks with long-tailed distributions, classes are asymmetrically affected during adversarial training, with lower robust accuracy for less frequent classes. In this regard, we propose a provable robustness method by leveraging the continuous piecewise-affine (CPA) nature of DNNs. Our method can impose linearity constraints on the decision boundary, as well as the DNN CPA partition, without requiring any adversarial training. Using such constraints, we show that the margin between the decision boundary and minority classes can be increased in a provable manner. We also present qualitative and quantitative validation of our method for class-specific robustness. |
Ahmed Imtiaz Humayun · Josue Casco-Rodriguez · Randall Balestriero · Richard Baraniuk 🔗 |
-
|
Adversarial Training in Continuous-Time Models and Irregularly Sampled Time-Series
(
Poster
)
>
link
This study presents the first steps of exploring the effects of adversarial training on continuous-time models and irregularly sampled time series data. Historically, these models and sampling techniques have been largely neglected in adversarial learning research, leading to a significant gap in our understanding of their performance under adversarial conditions. To address this, we conducted an empirical study of adversarial training techniques applied to time-continuous model architectures and sampling methods. Our findings suggest that while standard continuous-time models tend to outperform their discrete counterparts (especially on irregularly sampled datasets), this performance advantage diminishes almost entirely when adversarial training is employed. This indicates that adversarial training may interfere with the time-continuous representation, effectively neutralizing the benefits typically associated with these models. We believe these insights will be critical in guiding further advancements in adversarial learning research for continuous-time models. |
Alvin Li · Mathias Lechner · Alexander Amini · Daniela Rus 🔗 |
-
|
Few-shot Anomaly Detection via Personalization
(
Poster
)
>
link
Even with a plenty amount of normal samples, anomaly detection has been considered as a challenging machine learning task due to its one-class nature, i.e., the lack of anomalous samples in training time. It is only recently that a few-shot regime of anomaly detection became feasible in this regard, e.g., with a help from large vision-language pre-trained models such as CLIP, despite its wide applicability. In this paper, we explore the potential of large text-to-image generative models in performing few-shot anomaly detection. Specifically, recent text-to-image models have shown unprecedented ability to generalize from few images to extract their common and unique concepts, and even encode them into a textual token to "personalize" the model: so-called textual inversion. Here, we question whether this personalization is specific enough to discriminate the given images from their potential anomalies, which are often, e.g., open-ended, local, and hard-to-detect. We observe that the standard textual inversion is not enough for detecting anomalies accurately, and thus we propose a simple-yet an effective regularization scheme to enhance its specificity derived from the zero-shot transferability of CLIP. We also propose a self-tuning scheme to further optimize the performance of our detection pipeline, leveraging synthetic data generated from the personalized generative model. Our experiments show that the proposed inversion scheme could achieve state-of-the-art results on a wide range of few-shot anomaly detection benchmarks. |
Sangkyung Kwak · Jongheon Jeong · Hankook Lee · Woohyuck Kim · Jinwoo Shin 🔗 |
-
|
Rethinking Label Poisoning for GNNs: Pitfalls and Attacks
(
Poster
)
>
link
Node labels for graphs are usually generated using an automated process, or crowd-sourced from human users. This opens up avenues for malicious users to compromise the training labels, making it unwise to blindly rely on them. While robustness against noisy labels is an active area of research, there are only a handful of papers in the literature that address this for graph-based data. Even more so, the effects of adversarial label perturbations are sparsely studied. A recent work revealed that the entire literature on label poisoning for GNNs is plagued by serious evaluation pitfalls and showed how existing attacks render ineffective post fixing these shortcomings. In this work, we introduce two new simple yet effective attacks that are significantly stronger (up to $\sim8\%$) than the previous strongest attack. Our work demonstrates the need for more robust defense mechanisms, especially considering the \emph{transferability} of our attacks, where a strategy devised for one model can effectively contaminate numerous other models.
|
Vijay Lingam · Mohammad Sadegh Akhondzadeh · Aleksandar Bojchevski 🔗 |
-
|
Shrink & Cert: Bi-level Optimization for Certified Robustness
(
Poster
)
>
link
In this paper, we advance the concept of shrinking weights to train certifiably robust models from the fresh perspective of gradient-based bi-level optimization. Lack of robustness against adversarial attacks remains a challenge in safety-critical applications. Many attempts have been made in literature which only provide empirical verification of the defenses to certain attacks and can be easily broken. Methods in other lines of work can only develop certified guarantees of the model robustness in limited scenarios and are computationally expensive. We present a weight shrinkage formulation that is computationally inexpensive and can be solved as a simple first-order optimization problem. We show that model trained with our method has lower Lipschitz bounds in each layer, which directly provides formal guarantees on the certified robustness. We demonstrate that our approach, Shrink \& Cert (SaC) achieves provably robust networks which simultaneously give excellent standard and robust accuracy. We demonstrate the success of our approach on CIFAR-10 and ImageNet datasets and compare them with existing robust training techniques. Code : \url{https://github.com/sagarverma/Shrink-and-Cert} |
Kavya Gupta · Sagar Verma 🔗 |
-
|
Preventing Reward Hacking with Occupancy Measure Regularization
(
Poster
)
>
link
Reward hacking occurs when an agent exploits its specified reward function to behave in undesirable or unsafe ways. Aside from better alignment between the specified reward function and the system designer's intentions, a more feasible proposal to prevent reward hacking is to regularize the learned policy to some safe baseline. Current research suggests that regularizing the learned policy's action distributions to be more similar to those of a safe policy can mitigate reward hacking; however, this approach fails to take into account the disproportionate impact that some actions have on the agent’s state. Instead, we propose a method of regularization based on occupancy measures, which capture the proportion of time each policy is in a particular state-action pair during trajectories. We show theoretically that occupancy-based regularization avoids many drawbacks of action distribution-based regularization, and we introduce an algorithm called ORPO to practically implement our technique. We then empirically demonstrate that occupancy measure-based regularization is superior in both a simple gridworld and a more complex autonomous vehicle control environment. |
Cassidy Laidlaw · Shivam Singhal · Anca Dragan 🔗 |
-
|
Baselines for Identifying Watermarked Large Language Models
(
Poster
)
>
link
We consider the emerging problem of identifying the presence of watermarking schemes in publicly hosted, closed source large language models (LLMs). Rather than determine if a given text is generated by a watermarked language model, we seek to answer the question of if the model itself is watermarked. We introduce a suite of baseline algorithms for identifying watermarks in LLMs that rely on analyzing distributions of output tokens and logits generated by watermarked and unmarked LLMs. Notably, watermarked LLMs tend to produce token distributions that diverge qualitatively and identifiably from standard models. Furthermore, we investigate the identifiability of watermarks at varying strengths and consider the tradeoffs of each of our identification mechanisms with respect to watermarking scenario. |
Leonard Tang · Gavin Uberti · Tom Shlomi 🔗 |
-
|
Why do universal adversarial attacks work on large language models?: Geometry might be the answer
(
Poster
)
>
link
Transformer based large language models with emergent capabilities are becoming increasingly ubiquitous in society. However, the task of understanding and interpreting their internal workings, in the context of adversarial attacks, remains largely unsolved. Gradient-based universal adversarial attacks have been shown to be highly effective on large language models and potentially dangerous due to their input-agnostic nature. This work presents a novel geometric perspective explaining universal adversarial attacks on large language models. By attacking the 117M parameter GPT-2 model, we find evidence indicating that universal adversarial triggers could be embedding vectors which merely approximate the semantic information in their adversarial training region. This hypothesis is supported by white-box model analysis comprising dimensionality reduction and similarity measurement of hidden representations. We believe this new geometric perspective on the underlying mechanism driving universal attacks could help us gain deeper insight into the internal workings and failure modes of LLMs, thus enabling their mitigation. |
Varshini Subhash · Anna Bialas · Siddharth Swaroop · Weiwei Pan · Finale Doshi-Velez 🔗 |
-
|
FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation
(
Poster
)
>
link
We present FACADE, a novel probabilistic and geometric framework designed for unsupervised mechanistic anomaly detection in deep neural networks. Its primary goal is advancing the understanding and mitigation of adversarial attacks. FACADE aims to generate probabilistic distributions over circuits, which provide critical insights to their contribution to changes in the manifold properties of pseudo-classes, or high-dimensional modes in activation space, yielding a powerful tool for uncovering and combating adversarial attacks. Our approach seeks to improve model robustness, enhance scalable model oversight, and demonstrates promising applications in real-world deployment settings. |
Dhruv Pai · Andres Carranza · Rylan Schaeffer · Arnuv Tandon · Sanmi Koyejo 🔗 |
-
|
Robust Deep Learning via Layerwise Tilted Exponentials
(
Poster
)
>
link
State-of-the-art techniques for enhancing robustness of deep networks mostly rely on empirical risk minimization. In this paper, we propose a complementary approach aimed at enhancing the signal-to-noise ratio at intermediate network layers, loosely motivated by the classical communication-theoretic model of signaling in a noisy channel. We seek to learn neuronal weights which are matched to the layer inputs by supplementing end-to-end costs with a tilted exponential (TEXP) objective function which depends on the activations at the layer outputs. We show that TEXP learning can be interpreted as maximum likelihood estimation of matched filters under a Gaussian model for data noise. TEXP inference is accomplished by replacing batch norm by a tilted softmax enforcing competition across neurons, which can be interpreted as computation of posterior probabilities for the signaling hypotheses represented by each neuron. We show, by experimentation on standard image datasets, that TEXP learning and inference enhances robustness against noise, other common corruptions and mild adversarial perturbations, without requiring data augmentation. Further gains in robustness against this array of distortions can be obtained by appropriately combining TEXP with adversarial training. |
Bhagyashree Puranik · Ahmad Beirami · Yao Qin · Upamanyu Madhow 🔗 |
-
|
Teach GPT To Phish
(
Poster
)
>
link
Quantifying privacy risks in large language models (LLM) is an important research question. We take a step towards answering this question by defining a real-world threat model wherein an entity seeks to augment an LLM with private data they possess via fine-tuning.The entity also seeks to improve the quality of its LLM outputs over time by learning from human feedback.We propose a novel |
Ashwinee Panda · Zhengming Zhang · Yaoqing Yang · Prateek Mittal 🔗 |
-
|
Physics-oriented adversarial attacks on SAR image target recognition
(
Poster
)
>
link
SAR target recognition algorithms based on deep neural networks are widely used in key tasks such as wartime reconnaissance, environmental monitoring, but the security of SAR systems is also vulnerable to adversarial examples. The imaging process for SAR images in the physical world is dissimilar to that of optical images because SAR imaging is solely regulated by imaging equations rather than the what-you-see-is-what-you-get principle. As a result, generating SAR adversarial examples in the physical world requires considering thechanges in SAR imaging equations that happen after deploying physical devices. Thus, this study proposes a Physics-oriented adversarial attacks on SAR image target recognition. The proposed algorithm distinguishes itself through two key features: (1) SAR-BagNet is utilized to identify the salient regions of SAR targets recognized by classifiers, allowing for the exact position and size determination of the adversarial scatterers and enhancing interpretability; (2) Dynamic step size optimization, which is based on the difference equation, continuously refines the electromagnetic parameters, structural parameters, and texture parameters of the adversarial scatterers, leading to a higher search efficiency. In the simulation experiment, the generated adversarial examples can reduce the accuracy of the classifier to recognize the simulated image from 100 % to 14.4 %, thus verifying the method proposed in this paper. |
Jiahao Cui · wang Guo · Run Shao · tiandong Shi · Haifeng Li 🔗 |
-
|
Accurate, Explainable, and Private Models: Providing Recourse While Minimizing Training Data Leakage
(
Poster
)
>
link
Machine learning models are increasingly utilized across impactful domains to predict individual outcomes. As such, many models provide algorithmic recourse to individuals who receive negative outcomes. However, recourse can be leveraged by adversaries to disclose private information. This work presents the first attempt at mitigating such attacks. We present two novel methods to generate differentially private recourse: Differentially Private Model ($\texttt{DPM}$) and Laplace Recourse ($\texttt{LR}$). Using logistic regression classifiers and real world and synthetic datasets, we find that $\texttt{DPM}$ and $\texttt{LR}$ perform well in reducing what an adversary can infer, especially at low $\texttt{FPR}$. When training dataset size is large enough, we find particular success in preventing privacy leakage while maintaining model and recourse accuracy with our novel $\texttt{LR}$ method.
|
Catherine Huang · Chelse Swoopes · Christina Xiao · Jiaqi Ma · Himabindu Lakkaraju 🔗 |
-
|
Theoretically Principled Trade-off for Stateful Defenses against Query-Based Black-Box Attacks
(
Poster
)
>
link
Adversarial examples threaten the integrity of machine learning systems with alarming success rates even under constrained black-box conditions. Stateful defenses have emerged as an effective countermeasure, detecting potential attacks by maintaining a buffer of recent queries and detecting new queries that are too similar. However, these defenses fundamentally pose a trade-off between attack detection and false positive rates, and this trade-off is typically optimized by hand-picking feature extractors and similarity thresholds that empirically work well. There is little current understanding as to the formal limits of this trade-off and the exact properties of the feature extractors/underlying problem domain that influence it. This work aims to address this gap by offering a theoretical characterization of the trade-off between detection and false positive rates for stateful defenses. We provide upper bounds for detection rates of a general class of feature extractors and analyze the impact of this trade-off on the convergence of black-box attacks. We then support our theoretical findings with empirical evaluations across multiple datasets and stateful defenses. |
Ashish Hooda · Neal Mangaokar · Ryan Feng · Kassem Fawaz · Somesh Jha · Atul Prakash 🔗 |
-
|
DiffScene: Diffusion-Based Safety-Critical Scenario Generation for Autonomous Vehicles
(
Poster
)
>
link
The field of Autonomous Driving (AD) has witnessed significant progress in recent years. Among the various challenges faced, the safety evaluation of autonomous vehicles (AVs) stands out as a critical concern. Traditional evaluation methods are both costly and inefficient, often requiring extensive driving mileage in order to encounter rare safety-critical scenarios, which are distributed on the long tail of the complex real-world driving landscape. In this paper, we propose a unified approach, Diffusion-Based Safety-Critical Scenario Generation (DiffScene), to generate high-quality safety-critical scenarios which are both realistic and safety-critical for efficient AV evaluation. In particular, we propose a diffusion-based generation framework, leveraging the power of approximating the distribution of low-density spaces for diffusion models. We design several adversarial optimization objectives to guide the diffusion generation under predefined adversarial budgets. These objectives, such as safety-based objective, functionality-based objective, and constraint-based objective, ensure the generation of safety-critical scenarios while adhering to specific constraints. Extensive experimentation has been conducted to validate the efficacy of our approach. Compared with 6 SOTA baselines, DiffScene generates scenarios that are (1) more safety-critical under 3 metrics, (2) more realistic under 5 distance functions, and (3) more transferable to different AV algorithms. In addition, we demonstrate that training AV algorithms with scenarios generated by DiffScene leads to significantly higher performance in terms of the safety-critical metrics compared to baselines. These findings highlight the potential of DiffScene in addressing the challenges of AV safety evaluation, paving the way for more efficient and effective AV development. |
Chejian Xu · Ding Zhao · Alberto Sngiovanni Vincentelli · Bo Li 🔗 |
-
|
Improving Adversarial Training for Multiple Perturbations through the Lens of Uniform Stability
(
Poster
)
>
link
In adversarial training (AT), most existing works focus on AT with a single type of perturbation, such as the $\ell_\infty$ attacks. However, deep neural networks (DNNs) are vulnerable to different types of adversarial examples, necessitating the development of adversarial training for multiple perturbations (ATMP). Despite the benefits of ATMP, there exists a trade-off between different types of attacks. Furthermore, there is a lack of theoretical analyses of ATMP, which hinders its further development. To address these issues, we conduct a smoothness analysis of ATMP. Our analysis reveals that $\ell_1$, $\ell_2$, and $\ell_\infty$ adversaries contribute differently to the smoothness of the loss function in ATMP. Leveraging these smoothness properties, we investigate the improvement of ATMP through the lens of uniform stability. Through our research, we demonstrate that employing an adaptive smoothness-weighted learning rate leads to enhanced uniform stability bounds, thus improving adversarial training for multiple perturbations. We validate our findings through experiments on CIFAR-10 and CIFAR-100 datasets, where our approach achieves competitive performance against various mixtures of multiple perturbation attacks. This work contributes to a deeper understanding of ATMP and provides practical insights for improving the robustness of DNNs against diverse adversarial examples.
|
Jiancong Xiao · Zeyu Qin · Yanbo Fan · Baoyuan Wu · Jue Wang · Zhi-Quan Luo 🔗 |
-
|
A Theoretical Perspective on the Robustness of Feature Extractors
(
Poster
)
>
link
Recent theoretical work on robustness to adversarial examples has derived lower bounds on how robust any model can be when the distribution and adversarial constraints are specified. However, these bounds do not account for the specific models used in practice, such as neural networks. In this paper, we develop a methodology to analyze the fundamental limits on the robustness of fixed feature extractors, which in turn provides bounds on the robustness of classifiers trained on top of them. The tightness of these bounds relies on the effectiveness of the method used to find collisions between pairs of perturbed examples at deeper layers. For linear feature extractors, we provide closed-form expressions for collision finding while for piece-wise linear feature extractors, we propose a bespoke algorithm based on the iterative solution of a convex program that provably finds collisions. We utilize our bounds to identify structural features of classifiers that lead to a lack of robustness and provide insights into the effectiveness of different training methods at obtaining robust feature extractors. |
Arjun Nitin Bhagoji · Daniel Cullina · Ben Zhao 🔗 |
-
|
Characterizing the Optimal $0-1$ Loss for Multi-class Classification with a Test-time Attacker
(
Poster
)
>
link
Finding classifiers robust to adversarial examples is critical for their safe deployment. Determining the robustness of the best possible classifier under a given threat model for a fixed data distribution and comparing it to thatachieved by state-of-the-art training methods is thus an important diagnostictool. In this paper, we find achievable information-theoretic lower bounds onrobust loss in the presence of a test-time attacker for *multi-classclassifiers on any discrete dataset*. We provide a general framework for findingthe optimal $0-1$ loss that revolves around the construction of a conflicthypergraph from the data and adversarial constraints. The prohibitive cost ofthis formulation in practice leads us to formulate other variants of theattacker-classifier game that more efficiently determine the range of theoptimal loss. Our valuation shows, for the first time, an analysis of the gap tooptimal robustness for classifiers in the multi-class setting on benchmarkdatasets.
|
Sophie Dai · Wenxin Ding · Arjun Nitin Bhagoji · Daniel Cullina · Ben Zhao · Heather Zheng · Prateek Mittal 🔗 |
-
|
RODEO: Robust Out-of-distribution Detection via Exposing Adaptive Outliers
(
Poster
)
>
link
Detecting out-of-distribution (OOD) input samples at the inference time is a key element in the trustworthy deployment of intelligent models. While there has been a tremendous improvement in various flavors of OOD detection in recent years, the detection performance under adversarial settings lags far behind the performance in the standard setting. In order to bridge this gap, we introduce RODEO in this paper, a data-centric approach that generates effective outliers for robust OOD detection. More specifically, we first show that targeting the classification of adversarially perturbed in- and out-of-distribution samples through outlier exposure (OE) could be an effective strategy for the mentioned purpose as long as the training outliers meet certain quality standards. We hypothesize that the outliers in the OE should possess several characteristics simultaneously to be effective in the adversarial training: diversity, and both conceptual differentiability and analogy to the inliers. These aspects seem to play a more critical role in the adversarial setup compared to the standard training. Next, we propose to take advantage of existing text-to-image generative models, conditioned on the inlier or normal samples, and text prompts that minimally edit the normal samples, and turn them into near-distribution outliers. This process helps to satisfy the three mentioned criteria for the generated outliers, and significantly boosts the performance of OE specially in the adversarial setting. We demonstrate the general effectiveness of this approach in various related problems including novelty/anomaly detection, Open-Set Recognition (OSR), and OOD detection. We also make a comprehensive comparison of our method against other adaptive OE techniques under the adversarial setting to showcase its effectiveness. |
Hossein Mirzaei · Mohammad Jafari · Hamid Reza Dehbashi · Ali Ansari · Sepehr Ghobadi · Masoud Hadi · Arshia Soltani Moakhar · Mohammad Azizmalayeri · Mahdieh Soleymani Baghshah · Mohammad H Rohban 🔗 |
-
|
Rethinking Robust Contrastive Learning from the Adversarial Perspective
(
Poster
)
>
link
To advance the understanding of robust deep learning, we delve into the effects of adversarial training on self-supervised and supervised contrastive learning, alongside supervised learning. Our analysis uncovers significant disparities between adversarial and clean representations in standard-trained networks, across various learning algorithms. Remarkably, adversarial training mitigates these disparities and fosters the convergence of representations toward a universal set, regardless of the learning scheme used. Additionally, we observe that increasing the similarity between adversarial and clean representations, particularly near the end of the network, enhances network robustness. These findings offer valuable insights for designing and training effective and robust deep learning networks. |
Fatemeh Ghofrani · Mehdi Yaghouti · Pooyan Jamshidi 🔗 |
-
|
TMI! Finetuned Models Spill Secrets from Pretraining
(
Poster
)
>
link
Transfer learning has become an increasingly popular technique in machine learning as a way to leverage a pretrained model trained for related tasks. This paradigm has been especially popular for \emph{privacy preserving machine learning}, where the pretrained model is considered public, and only the data for finetuning is considered sensitive. However, there are reasons to believe that the data used for pretraining is still sensitive. In this work we study privacy leakage via membership-inference attacks, and we propose a new threat model where the adversary only has access to the finetuned model and would like to infer the membership of the pretraining data. To realize this threat model, we implement a novel metaclassifier-based attack, TMI. We evaluate TMI on both vision and natural language tasks across multiple transfer learning settings, including finetuning with differential privacy. Through our evaluation, we find that TMI can successfully infer membership of pretraining examples using query access to the finetuned model. |
John Abascal · Stanley Wu · Alina Oprea · Jonathan Ullman 🔗 |
-
|
A First Order Meta Stackelberg Method for Robust Federated Learning
(
Poster
)
>
link
Previous research has shown that federated learning (FL) systems are exposed to an array of security risks. Despite the proposal of several defensive strategies, they tend to be non-adaptive and specific to certain types of attacks, rendering them ineffective against unpredictable or adaptive threats. This work models adversarial federated learning as a Bayesian Stackelberg Markov game (BSMG) to capture the defender's incomplete information of various attack types. We propose meta-Stackelberg learning (meta-SL), a provably efficient meta-learning algorithm, to solve the equilibrium strategy in BSMG, leading to an adaptable FL defense. We demonstrate that meta-SL converges to the first-order $\varepsilon$-equilibrium point in $O(\varepsilon^{-2})$ gradient iterations, with $O(\varepsilon^{-4})$ samples needed per iteration, matching the state of the art. Empirical evidence indicates that our meta-Stackelberg framework performs exceptionally well against potent model poisoning and backdoor attacks of an uncertain nature.
|
Yunian Pan · Tao Li · Henger Li · Tianyi Xu · Quanyan Zhu · Zizhan Zheng 🔗 |
-
|
Backdoor Attacks for In-Context Learning with Language Models
(
Poster
)
>
link
Because state-of-the-art language model are expensive to train, most practitioners must make use of one of the few publicly available language models or language model APIs. This consolidation of trust increases the potency of backdoor attacks, where an adversary tampers with a machine learning model in order to make it perform some malicious behavior on inputs that contain a predefined backdoor trigger. We show that the in-context learning ability of large language models significantly complicates the question of developing backdoor attacks, as a successful backdoor must work against various prompting strategies and should not affect the model's general purpose capabilities. We design a new attack for eliciting targeted misclassification when language models are prompted to perform a particular target task and demonstrate the feasibility of this attack by backdooring multiple large language models ranging in size from 1.3 billion to 6 billion parameters. Finally we study defenses to mitigate the potential harms of our attack: for example, while in the white-box setting we show that fine-tuning models for as few as 500 steps suffices to remove the backdoor behavior, in the black-box setting we are unable to develop a successful defense that relies on prompt engineering alone. |
Nikhil Kandpal · Matthew Jagielski · Florian Tramer · Nicholas Carlini 🔗 |
-
|
R-LPIPS: An Adversarially Robust Perceptual Similarity Metric
(
Poster
)
>
link
Similarity metrics have played a significant role in computer vision to capture the underlying semantics of images. In recent years, advanced similarity metrics, such as the Learned Perceptual Image Patch Similarity (LPIPS), have emerged. These metrics leverage deep features extracted from trained neural networks and have demonstrated a remarkable ability to closely align with human perception when evaluating relative image similarity. However, it is now well-known that neural networks are susceptible to adversarial examples, i.e., small perturbations invisible to humans crafted to deliberately mislead the model. Consequently, the LPIPS metric is also sensitive to such adversarial examples. This susceptibility introduces significant security concerns, especially considering the widespread adoption of LPIPS in large-scale applications. In this paper, we propose the Robust Learned Perceptual Image Patch Similarity (R-LPIPS) metric, a new metric that leverages adversarially trained deep features. Through a comprehensive set of experiments, we demonstrate the superiority of R-LPIPS compared to the classical LPIPS metric. |
Sara Ghazanfari · Siddharth Garg · Prashanth Krishnamurthy · Farshad Khorrami · Alexandre Araujo 🔗 |
-
|
Risk-Averse Predictions on Unseen Domains via Neural Style Smoothing
(
Poster
)
>
link
Achieving high accuracy on data from domains unseen during training is a fundamental challenge in machine learning. While state-of-the-art neural networks have achieved impressive performance on various tasks, their predictions are biased towards domain-dependent information (ex. image styles) rather than domain-invariant information (ex. image content). This makes them unreliable for deployment in risk-sensitive settings such as autonomous driving. In this work, we propose a novel inference procedure, Test-Time Neural Style Smoothing (TT-NSS), that produces risk-averse predictions using a ``style smoothed'' version of a classifier. Specifically, the style smoothed classifier classifies a test image as the most probable class predicted by the original classifier on random re-stylizations of the test image. TT-NSS uses a neural style transfer module to stylize the test image on the fly, requires black-box access to the classifier, and crucially, abstains when predictions of the original classifier on the stylized images lack consensus. We further propose a neural style smoothing-based training procedure that improves the prediction consistency and the performance of the style-smoothed classifier on non-abstained samples. Our experiments on the PACS dataset and its variations, both in single and multiple domain settings highlight the effectiveness of our methods at producing risk-averse predictions on unseen domains. |
Akshay Mehra · Yunbei Zhang · Bhavya Kailkhura · Jihun Hamm 🔗 |
-
|
A Simple and Yet Fairly Effective Defense for Graph Neural Networks
(
Poster
)
>
link
Graph neural networks (GNNs) have become the standard approach for performing machine learning on graphs. However, concerns have been raised regarding their vulnerability to small adversarial perturbations. Existing defense methods suffer from high time complexity and can negatively impact the model's performance on clean graphs. In this paper, we propose NoisyGCN, a defense method that injects noise into the GCN architecture. We derive a mathematical upper bound linking GCN's robustness to noise injection, establishing our method's effectiveness. Through empirical evaluations on the node classification task, we demonstrate superior or comparable performance to existing methods while minimizing the added time complexity. |
Sofiane ENNADIR · Yassine Abbahaddou · Michalis Vazirgiannis · Henrik Boström 🔗 |
-
|
Incentivizing Honesty among Competitors in Collaborative Learning
(
Poster
)
>
link
Collaborative learning techniques have the potential to enable training machine learning models that are superior to models trained on a single entity’s data. However, in many cases, potential participants in such collaborative schemes are competitors on a downstream task, such as firms that each aim to attract customers by providing the best recommendations. This can incentivize dishonest updates that damage other participants' models, potentially undermining the benefits of collaboration. In this work, we formulate a game that models such interactions and study two learning tasks within this framework: single-round mean estimation and multi-round SGD on strongly-convex objectives. For a natural class of player actions, we show that rational clients are incentivized to strongly manipulate their updates, thus preventing learning. We then propose mechanisms that incentivize honest communication and ensure learning quality comparable to full cooperation. Our work shows that explicitly modeling the incentives and actions of dishonest clients, rather than assuming them malicious, can enable strong robustness guarantees for collaborative learning. |
Florian Dorner · Nikola Konstantinov · Georgi Pashaliev · Martin Vechev 🔗 |
-
|
Towards Effective Data Poisoning for Imbalanced Classification
(
Poster
)
>
link
Targeted Clean-label Data Poisoning Attacks (TCPDA) aim to manipulate training samples in a label-consistent manner to gain malicious control over targeted samples' output during deployment. A prominent class of TCDPA methods, gradient-matching based data-poisoning methods, utilize a small subset of training class samples to match the poisoned gradient of a target sample. However, their effectiveness is limited when attacking imbalanced datasets because of gradient mis-match due to training time data balancing techniques like Re-weighting and Re-sampling. In this paper, we propose two modifications that eliminate this gradient-mismatch and thereby enhance the efficacy of gradient-matching-based TCDPA on imbalanced datasets. Our methods achieve notable improvements of up to 32% (Re-sampling) and 51% (Re-weighting) in terms of Attack Effect Success Rate on MNIST and CIFAR10. |
Snigdha Sushil Mishra · Hao He · Hao Wang 🔗 |
-
|
Black Box Adversarial Prompting for Foundation Models
(
Poster
)
>
link
Prompting interfaces allow users to quickly adjust the output of generative models in both vision and language. However, small changes and design choices in the prompt can lead to significant differences in the output. In this work, we develop a black-box framework for generating adversarial prompts for unstructured image and text generation. These prompts, which can be standalone or prepended to benign prompts, induce specific behaviors into the generative process, such as generating images of a particular object or generating high perplexity text. |
Natalie Maus · Patrick Chao · Eric Wong · Jacob Gardner 🔗 |
-
|
Exposing the Fake: Effective Diffusion-Generated Images Detection
(
Poster
)
>
link
Image synthesis has seen significant advancements with the advent of diffusion-based generative models like Denoising Diffusion Probabilistic Models (DDPM) and text-to-image diffusion models. Despite their efficacy, there is a dearth of research dedicated to detecting diffusion generated images, which could pose potential security and privacy risks. This paper addresses this gap by proposing a novel detection method called Stepwise Error for Diffusion-generated Image Detection (SeDID). Comprising statistical-based SeDID and neural network-based SeDID, SeDID exploits the unique attributes of diffusion models, namely deterministic reverse and deterministic denoising computation errors. Our evaluations demonstrate SeDID’s superior performance over existing methods when applied to diffusionmodels. Thus, our work makes a pivotal contribution to distinguishing diffusion model-generated images, marking a significant step in the domain of artificial intelligence security. |
RuiPeng Ma · Jinhao Duan · Fei Kong · Xiaoshuang Shi · Kaidi Xu 🔗 |
-
|
AdversNLP: A Practical Guide to Assessing NLP Robustness Against Text Adversarial Attacks
(
Poster
)
>
link
The emergence of powerful language models in natural language processing (NLP) has sparked a wave of excitement for their potential to revolutionize decision-making. However, this excitement should be tempered by their vulnerability to adversarial attacks, which are carefully perturbed inputs able to fool the model into inaccurate decisions. In this paper, we present AdversNLP, a practical framework to assess the robustness of NLP applications against text-based adversaries. Our framework combines and extends upon the technical capabilities of established NLP adversarial attacking tools (i.e. TextAttack) and tailors an audit guide to navigate the landscape of threats to NLP applications. AdversNLP illustrates best practices, and vulnerabilities through customized attacking recipes, and presenting evaluation metrics in the form of key performance indicators (KPIs). Our study demonstrates the severity of the threat posed by adversarial attacks and the need for more initiatives bridging the gap between research contributions and industrial applications. |
Othmane BELMOUKADAM 🔗 |
-
|
Proximal Compositional Optimization for Distributionally Robust Learning
(
Poster
)
>
link
Recently, compositional optimization (CO) has gained popularity because of its applications in distributionally robust optimization (DRO) and many other machine learning problems. Often (non-smooth) regularization terms are added to an objective to impose some structure and/or improve the generalization performance of the learned model. However, when it comes to CO, there is a lack of efficient algorithms that can solve regularized CO problems. Moreover, current state-of-the-art methods to solve such problems rely on the computation of large batch gradients (depending on the solution accuracy) not feasible for most practical settings. To address these challenges, in this work, we consider a certain regularized version of the CO problem that often arises in DRO formulations and develop a proximal algorithm for solving the problem. We perform a Moreau envelope-based analysis and establish that without the need to compute large batch gradients \anamec~achieves $\mathcal{O}(\epsilon^{-2})$ sample complexity, that matches the vanilla SGD guarantees for solving non-CO problems. We corroborate our theoretical findings with empirical studies on large-scale DRO problems.
|
Prashant Khanduri · Chengyin Li · RAFI IBN SULTAN · Yao Qiang · Joerg Kliewer · Dongxiao Zhu 🔗 |
-
|
PIAT: Parameter Interpolation based Adversarial Training for Image Classification
(
Poster
)
>
link
Adversarial training has been demonstrated to be the most effective approach to defend against adversarial attacks. However, existing adversarial training methods show apparent oscillations and overfitting issues in the training process, degrading the defense efficacy. In this work, we propose a novel framework, termed Parameter Interpolation based Adversarial Training (PIAT), that makes full use of the historical information during training. Specifically, at the end of each epoch, PIAT tunes the model parameters as the interpolation of the parameters of the previous and current epochs. Besides, we suggest to use the Normalized Mean Square Error (NMSE) to further improve the robustness by aligning the relative magnitude of logits between clean and adversarial examples, rather than the absolute magnitude. Extensive experiments on several benchmark datasets and various networks show that our framework could prominently improve the model robustness and reduce the generalization error. |
Kun He · Xin Liu · Yichen Yang · Zhou Qin · Weigao Wen · Hui Xue' · John Hopcroft 🔗 |
-
|
Mathematical Theory of Adversarial Deep Learning
(
Poster
)
>
link
In this Show-and-Tell Demos paper, progresses on mathematical theories for adversarial deep learning are reported.Firstly, achieving robust memorization for certain neural networks is shown to be an NP-hard problem. Furthermore, neural networks with $O(Nn)$ parameters are constructed for optimal robust memorization of any dataset with dimension $n$ and size $N$ in polynomial time. Secondly, adversarial training is formulated as a Stackelberg game and is shown to result in a network with optimal adversarial accuracy when the Carlini-Wagner's margin loss is used. Finally, the bias classifier is introduced and is shown to be information-theoretically secure against the original-model gradient-based attack.
|
Xiao-Shan Gao · Lijia Yu · Shuang Liu 🔗 |
-
|
Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations
(
Poster
)
>
link
Robust reinforcement learning (RL) seeks to train policies that can perform well under environment perturbations or adversarial attacks. Existing approaches typically assume that the space of possible perturbations remains the same across timesteps. However, in many settings, the space of possible perturbations at a given timestep depends on past perturbations. We formally introduce temporally-coupled perturbations, presenting a novel challenge for existing robust RL methods. To tackle this challenge, we propose GRAD, a novel game-theoretic approach that treats the temporally-coupled robust RL problem as a partially-observable two-player zero-sum game. By finding an approximate equilibrium in this game, GRAD ensures the agent's robustness against temporally-coupled perturbations. Empirical experiments on a variety of continuous control tasks demonstrate that our proposed approach exhibits significant robustness advantages compared to baselines against both standard and temporally-coupled attacks, in both state and action spaces. |
Yongyuan Liang · Yanchao Sun · Ruijie Zheng · Xiangyu Liu · Tuomas Sandholm · Furong Huang · Stephen Mcaleer 🔗 |
-
|
Navigating Graph Robust Learning against All-Intensity Attacks
(
Poster
)
>
link
Graph Neural Networks have demonstrated exceptional performance in a variety of graph learning tasks, but their vulnerability to adversarial attacks remains a major concern. Accordingly, many defense methods have been developed to learn robust graph representations and mitigate the impact of adversarial attacks. However, most of the existing methods suffer from two major drawbacks: {(i) their robustness degrades under higher-intensity attacks}, and {(ii) they cannot scale to large graphs.} In light of this, we develop a novel graph defense method to address these limitations. Our method first applies a denoising module to recover a cleaner graph by removing edges associated with attacked nodes, then, it utilizes Mixture-of-Experts to select differentially private noises of different magnitudes to counteract the node features attacked at different intensities. In addition, the overall design of our method avoids relying on heavy adjacency matrix computations like SVD, thus enabling the framework's applicability on large graphs. |
Xiangchi Yuan · Chunhui Zhang · Yijun Tian · Chuxu Zhang 🔗 |
-
|
Towards Out-of-Distribution Adversarial Robustness
(
Poster
)
>
link
Adversarial robustness continues to be a major challenge for deep learning. A core issue is that robustness to one type of attack often fails to transfer to other attacks. While prior work establishes a theoretical trade-off in robustness against different $L_p$ norms, we show that there is potential for improvement against many commonly used attacks by adopting a domain generalisation approach.Concretely, we treat each type of attack as a domain, and apply the Risk Extrapolation method (REx), which promotes similar levels of robustness against all training attacks. Compared to existing methods, we obtain similar or superior worst-case adversarial robustness on attacks seen during training. Moreover, we achieve superior performance on families or tunings of attacks only encountered at test time. On ensembles of attacks, our approach improves the accuracy from 3.4\% with the best existing baseline to 25.9\% on MNIST, and from 16.9\% to 23.5\% on CIFAR10.
|
Adam Ibrahim · Charles Guille-Escuret · Ioannis Mitliagkas · Irina Rish · David Krueger · Pouya Bashivan 🔗 |
-
|
Generalizable Lightweight Proxy for Robust NAS against Diverse Perturbations
(
Poster
)
>
link
Recent neural architecture search (NAS) frameworks have been successful in finding optimal architectures for given conditions (e.g., performance or latency). However, they search for optimal architectures in terms of their performance on clean images only, while robustness against various types of perturbations or corruptions is crucial in practice. Although there exist several robust NAS frameworks that tackle this issue by integrating adversarial training into one-shot NAS, however, they are limited in that they only consider robustness against adversarial attacks and require significant computational resources to discover optimal architectures for a single task, which makes them impractical in real-world scenarios. To address these challenges, we propose a novel lightweight robust zero-cost proxy that considers the consistency across features, parameters, and gradients of both clean and perturbed images at the initialization state. Our approach facilitates an efficient and rapid search for neural architectures capable of learning generalizable features that exhibit robustness across diverse perturbations. The experimental results demonstrate that our proxy can rapidly and efficiently search for neural architectures that are consistently robust against various perturbations on multiple benchmark datasets and diverse search spaces, largely outperforming existing clean zero-shot NAS and robust NAS with reduced search cost. |
Hyeonjeong Ha · Minseon Kim · Sung Ju Hwang 🔗 |
-
|
Adversarial Robustness for Tabular Data through Cost and Utility Awareness
(
Poster
)
>
link
Many machine learning applications (credit scoring, fraud detection, etc.) use data in the tabular domains. Adversarial examples can be especially damaging for these applications. Yet, existing works on adversarial robustness mainly focus on machine-learning models in the image and text domains. We argue that due to the differences between tabular data and images or text, existing threat models are inappropriate for tabular domains. These models do not capture that cost can be more important than imperceptibility, nor that the adversary could ascribe different value to the utility obtained from deploying different adversarial examples. We show that due to these differences the attack and defense methods used for images and text cannot be directly applied to the tabular setup. We address these issues by proposing new cost and utility-aware threat models tailored to capabilities and constraints of attackers targeting tabular domains. We show that our approach is effective on two tabular datasets corresponding to applications for which attacks can have economic and social implications. |
Klim Kireev · Bogdan Kulynych · Carmela Troncoso 🔗 |
-
|
Scoring Black-Box Models for Adversarial Robustness
(
Poster
)
>
link
Deep neural networks are susceptible to adversarial inputs and various methods have been proposed to defend these models against adversarial attacks under different perturbation models. The robustness of models to adversarial attacks has been analyzed by first constructing adversarial inputs for the model, and then testing the model performance on the constructed adversarial inputs. Most of these attacks require the model to be white-box, need access to data labels, and finding adversarial inputs can be computationally expensive. We propose a simple scoring method for black-box models which indicates their robustness to adversarial input. We show that adversarially more robust models have a smaller $l_1$-norm of Lime weights and sharper explanations.
|
Jian Vora · Pranay Reddy Samala 🔗 |
-
|
When Can Linear Learners be Robust to Indiscriminate Poisoning Attacks?
(
Poster
)
>
link
We study indiscriminate poisoning for linear learners where an adversary injects a few crafted examples into the training data with the goal of forcing the induced model to incur higher test error. Inspired by the observation that linear learners on some datasets are able to resist the best known attacks even without any defenses, we further investigate whether datasets can be inherently robust to indiscriminate poisoning attacks for linear learners. For theoretical Gaussian distributions, we rigorously characterize the behavior of an optimal poisoning attack, defined as the poisoning strategy that attains the maximum risk of the induced model at a given poisoning budget. Our results prove that linear learners can indeed be robust to indiscriminate poisoning if the class-wise data distributions are well-separated with low variance and the size of the constraint set containing all permissible poisoning points is also small. These findings largely explain the drastic variation in empirical attack performance of the state-of-the-art poisoning attacks across benchmark datasets, making an important initial step towards understanding the underlying reasons some learning tasks are vulnerable to data poisoning attacks. |
Fnu Suya · Xiao Zhang · Yuan Tian · David Evans 🔗 |
-
|
Context-Aware Self-Adaptation for Domain Generalization
(
Poster
)
>
link
Domain generalization aims at developing suitable learning algorithms in source training domains such that the model learned can generalize well on a different unseen testing domain.We present a novel two-stage approach called Context-Aware Self-Adaptation (CASA) for domain generalization. CASA simulates an approximate meta-generalization scenario and incorporates a self-adaptation module to adjust pre-trained meta-source models to the meta-target domains while maintaining their predictive capability on the meta-source domains. The core concept of self-adaptation involves leveraging contextual information, such as the mean of mini-batch features, as domain knowledge to automatically adapt a model trained in the first stage to new contexts in the second stage.Lastly, we utilize an ensemble of multiple meta-source models to perform inference on the testing domain.Experimental results demonstrate that our proposed method achieves state-of-the-art performance on standard benchmarks. |
Hao Yan · Yuhong Guo 🔗 |
-
|
Label Noise: Correcting a Correction Loss
(
Poster
)
>
link
Training neural network classifiers on datasets with label noise poses a risk of overfitting them to the noisy labels. To address this issue, researchers have explored alternative loss functions that aim to be more robust. However, many of these alternatives are heuristic in nature and still vulnerable to overfitting or underfitting. In this work, we propose a more direct approach to tackling overfitting caused by label noise. We observe that the presence of label noise implies a lower bound on the noisy generalised risk. Building upon this observation, we propose imposing a lower bound on the empirical risk during training to mitigate overfitting. Our main contribution is providing theoretical results that yield explicit, easily computable bounds on the minimum achievable noisy risk for different loss functions. We empirically demonstrate that using these bounds significantly enhances robustness in various settings, with virtually no additional computational cost. |
William Toner · Amos Storkey 🔗 |
-
|
Robust Semantic Segmentation: Strong Adversarial Attacks and Fast Training of Robust Models
(
Poster
)
>
link
While a large amount of work has focused on designing adversarial attacks against image classifiers, only a few methods exist to attack semantic segmentation models. We show that attacking segmentation models presents task-specific challenges, for which we propose novel solutions. Our final evaluation protocol outperforms existing methods, and shows that those can overestimate the robustness of the models. Additionally, so far adversarial training, the most successful way for obtaining robust image classifiers, could not be successfully applied to semantic segmentation. We argue that this is because the task to be learned is more challenging, and requires significantly higher computational effort than for image classification. As a remedy, we show that by taking advantage of recent advances in robust ImageNet classifiers, one can train adversarially robust segmentation models at limited computational cost by fine-tuning robust backbones. |
Francesco Croce · Naman Singh · Matthias Hein 🔗 |
-
|
Model-tuning Via Prompts Makes NLP Models Adversarially Robust
(
Poster
)
>
link
In recent years, NLP practitioners have converged on the following practice:(i) import an off-the-shelf pretrained (masked) language model;(ii) append a multilayer perceptron atop the CLS token's hidden representation(with randomly initialized weights);and (iii) fine-tune the entire model on a downstream task (MLP-FT).This procedure has produced massive gains on standard NLP benchmarks,but these models remain brittle, even to mild adversarial perturbations,such as word-level synonym substitutions.In this work, we demonstrate surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP),an alternative method of adapting to downstream tasks.Rather than modifying the model (by appending an MLP head),MVP instead modifies the input (by appending a prompt template). Across three classification datasets,MVP improves performance against adversarial word-level synonym substitutions by an average of 8% over standard methods and even outperforms adversarial training-based state-of-art defenses by 3.5%.By combining MVP with adversarial training, we achieve further improvements in robust accuracywhile maintaining clean accuracy.Finally, we conduct ablations to investigate the mechanism underlying these gains.Notably, we find that the main causes of vulnerability of MLP-FT can be attributed to the misalignment between pre-training and fine-tuning tasks, and the randomly initialized MLP parameters. |
Mrigank Raman · Pratyush Maini · Zico Kolter · Zachary Lipton · Danish Pruthi 🔗 |
-
|
Which Models have Perceptually-Aligned Gradients? An Explanation via Off-Manifold Robustness
(
Poster
)
>
link
One of the remarkable properties of robust computer vision models is that their input-gradients are often aligned with human perception, referred to in the literature as perceptually-aligned gradients (PAGs). However, the underlying mechanisms behind these phenomena remain unknown. In this work, we provide a first explanation of PAGs via \emph{off-manifold robustness}, which states that models must be more robust off- the data manifold than they are on-manifold. We first demonstrate theoretically that off-manifold robustness leads input gradients to lie approximately on the data manifold, explaining their perceptual alignment, and then confirm the same empirically for models trained with robustness regularizers. Quantifying the perceptual alignment of model gradients via their similarity with the gradients of generative models, we show that off-manifold robustness correlates well with perceptual alignment. Finally, based on the levels of on- and off-manifold robustness, we identify three different regimes of robustness that affect both perceptual alignment and model accuracy: weak robustness, bayes-aligned robustness, and excessive robustness. |
Suraj Srinivas · Sebastian Bordt · Himabindu Lakkaraju 🔗 |
-
|
Refined and Enriched Physics-based Captions for Unseen Dynamic Changes
(
Poster
)
>
link
Vision-Language models (VLMs), i.e., image-textpairs of CLIP, have boosted image-based DeepLearning (DL). Unseen images by transferring semanticknowledge from seen classes can be dealtwith with the help of language models pre-trainedonly with texts. Two-dimensional spatial relationshipsand a higher semantic level have beenperformed. Moreover, Visual-Question-Answer(VQA) tools and open-vocabulary semantic segmentationprovide us with more detailed scenedescriptions, i.e., qualitative texts, in captions.However, the capability of VLMs presents stillfar from that of human perception. This paperproposes PanopticCAP for refined and enrichedqualitative and quantitative captions to make themcloser to what human recognizes by combiningmultiple DLs and VLMs. In particular, captionswith physical scales and objects’ surface propertiesare integrated by counting, visibility distance,and road conditions. Fine-tuned VLM models arealso used. An iteratively refined caption modelwith a new physics-based contrastive loss functionis used. Experimental results using images withadversarial weather conditions, i.e., rain, snow,fog, landslide, flooding, and traffic events, i.e.,accidents, outperform state-of-the-art DLs andVLMs. A higher semantic level in captions forreal-world scene descriptions is shown. |
Hidetomo Sakaino 🔗 |
-
|
Adaptive Certified Training: Towards Better Accuracy-Robustness Tradeoffs
(
Poster
)
>
link
As deep learning models continue to advance and are increasingly utilized in real-world systems, the issue of robustness remains a major challenge. Existing certified training methods produce models that achieve high provable robustness guarantees at certain perturbation levels. However, the main problem of such models is a dramatically low standard accuracy, i.e. accuracy on clean unperturbed data, that makes them impractical. In this work, we consider a more realistic perspective of maximizing the robustness of a model at certain levels of (high) standard accuracy. To this end, we propose a novel certified training method based on a key insight that training with adaptive certified radii helps to improve both the accuracy and robustness of the model, advancing state-of-the-art accuracy-robustness tradeoffs. We demonstrate the effectiveness of the proposed method on MNIST, CIFAR-10, and TinyImageNet datasets. Particularly, on CIFAR-10 and TinyImageNet, our method yields models with up to two times higher robustness, measured as an average certified radius of a test set, at the same levels of standard accuracy compared to baseline approaches. |
Zhakshylyk Nurlanov · Frank R Schmidt · Florian Bernard 🔗 |
-
|
Illusory Attacks: Detectability Matters in Adversarial Attacks on Sequential Decision-Makers
(
Poster
)
>
link
Autonomous agents deployed in the real world need to be robust against adversarial attacks on sensory inputs. Robustifying agent policies requires anticipating the strongest attacks possible.We demonstrate that existing observation-space attacks on reinforcement learning agents have a common weakness: while effective, their lack of temporal consistency makes them \textit{detectable} using automated means or human inspection. Detectability is undesirable to adversaries as it may trigger security escalations.We introduce \textit{perfect illusory attacks}, a novel form of adversarial attack on sequential decision-makers that is both effective and provably \textit{statistically undetectable}. We then propose the more versatile \eattacks{}, which result in observation transitions that are consistent with the state-transition function of the adversary-free environment and can be learned end-to-end.Compared to existing attacks, we empirically find \eattacks{} to be significantly harder to detect with automated methods, and a small study with human subjects\footnote{IRB approval under reference xxxxxx/xxxxx} suggests they are similarly harder to detect for humans. We propose that undetectability should be a central concern in the study of adversarial attacks on mixed-autonomy settings. |
Tim Franzmeyer · Stephen Mcaleer · Joao Henriques · Jakob Foerster · Phil Torr · Adel Bibi · Christian Schroeder 🔗 |
-
|
Certified Calibration: Bounding Worst-Case Calibration under Adversarial Attacks
(
Poster
)
>
link
Since neural classifiers are known to be sensitive to adversarial perturbations that alter their accuracy, certification methods have been developed to provide provable guarantees on the insensitivity of their predictions to such perturbations. However, in safety-critical applications, the frequentist interpretation of the confidence of a classifier (also known as model calibration) can be of utmost importance. This property can be measured via the Brier Score or the Expected Calibration Error. We show that attacks can significantly harm calibra- tion, and thus propose certified calibration providing worst-case bounds on calibration under adversarial perturbations. Specifically, we produce analytic bounds for the Brier score and approximate bounds via the solution of a mixed-integer program on the Expected Calibration Error. |
Cornelius Emde · Francesco Pinto · Thomas Lukasiewicz · Phil Torr · Adel Bibi 🔗 |
-
|
Don't trust your eyes: on the (un)reliability of feature visualizations
(
Poster
)
>
link
How do neural networks extract patterns from pixels? Feature visualizations attempt to answer this important question by visualizing highly activating patterns through optimization. Today, visualization methods form the foundation of our knowledge about the internal workings of neural networks, as a type of mechanistic interpretability. Here we ask: How reliable are feature visualizations? We start our investigation by developing network circuits that trick feature visualizations into showing arbitrary patterns that are completely disconnected from normal network behavior on natural input. We then provide evidence for a similar phenomenon occurring in standard, unmanipulated networks: feature visualizations are processed very differently from standard input, casting doubt on their ability to "explain" how neural networks process natural images. We underpin this empirical finding by theory proving that the set of functions that can be reliably understood by feature visualization is extremely small and does not include black-box neural networks. |
Robert Geirhos · Roland S. Zimmermann · Blair Bilodeau · Wieland Brendel · Been Kim 🔗 |
-
|
Classifier Robustness Enhancement Via Test-Time Transformation
(
Poster
)
>
link
It has been recently discovered that adversarially trained classifiers exhibit an intriguing property, referred to as perceptually aligned gradients (PAG). PAG implies that the gradients of such classifiers possess a meaningful structure, aligned with human perception. Adversarial training is currently the best-known way to achieve classification robustness under adversarial attacks. The PAG property, however, has yet to be leveraged for further improving classifier robustness. In this work, we introduce Classifier Robustness Enhancement Via Test-Time Transformation (TETRA) -- a novel defense method that utilizes PAG, enhancing the performance of trained robust classifiers. Our method operates in two phases. First, it modifies the input image via a designated targeted adversarial attack into each of the dataset's classes. Then, it classifies the input image based on the distance to each of the modified instances, with the assumption that the shortest distance relates to the true class. We show that the proposed method achieves state-of-the-art results and validate our claim through extensive experiments on a variety of defense methods, classifier architectures, and datasets. We also empirically demonstrate that TETRA can boost the accuracy of any differentiable adversarial training classifier across a variety of attacks, including ones unseen at training. Specifically, applying TETRA leads to substantial improvement of up to $+23\%$, $+20\%$, and $+26\%$ on CIFAR10, CIFAR100, and ImageNet, respectively.
|
Tsachi Blau · Roy Ganz · Chaim Baskin · Michael Elad · Alex Bronstein 🔗 |
-
|
CertViT: Certified Robustness of Pre-Trained Vision Transformers
(
Poster
)
>
link
Lipschitz bounded neural networks are certifiably robust and have a good trade-off between clean and certified accuracy. Existing Lipschitz bounding methods train from scratch and are limited to moderately sized networks (< 6M parameters). They require a fair amount of hyper-parameter tuning and are computationally prohibitive for large networks like Vision Transformers (5M to 660M parameters). Obtaining certified robustness of transformers is not feasible due to the non-scalability and inflexibility of the current methods. This work presents CertViT, a two-step proximal-projection method to achieve certified robustness from pre-trained weights. The proximal step tries to lower the Lipschitz bound and the projection step tries to maintain the clean accuracy of pre-trained weights. We show that CertViT networkshave better certified accuracy than state-of-the-art Lipschitz trained networks. We apply CertViT on several variants of pre-trained vision transformers and show adversarial robustness using standard attacks. Code : \url{https://github.com/sagarverma/transformer-lipschitz} |
Kavya Gupta · Sagar Verma 🔗 |
-
|
Transferable Adversarial Perturbations between Self-Supervised Speech Recognition Models
(
Poster
)
>
link
A targeted adversarial attack produces audio samples that can force an Automatic Speech Recognition (ASR) system to output attacker-chosen text. To exploit ASR models in real-world, black-box settings, an adversary can leverage the \textit{transferability} property, i.e. that an adversarial sample produced for a proxy ASR can also fool a different remote ASR. Recent work has shown that transferability against large ASR models is extremely difficult. In this work, we show that modern ASR architectures, specifically ones based on Self-Supervised Learning, are uniquely affected by transferability. We successfully demonstrate this phenomenon by evaluating state-of-the-art self-supervised ASR models like Wav2Vec2, HuBERT, Data2Vec and WavLM. We show that with relatively low-level additive noise achieving a 30dB Signal-Noise Ratio, we can achieve target transferability with up to 80\% accuracy. We then use an ablation study to show that Self-Supervised learning is a major cause of that phenomenon. Our results present a dual interest: they show that modern ASR architectures are uniquely vulnerable to adversarial security threats, and they help understanding the specificities of SSL training paradigms. |
Raphaël Olivier · Hadi Abdullah · Bhiksha Raj 🔗 |
-
|
Unsupervised Adversarial Detection without Extra Model: Training Loss Should Change
(
Poster
)
>
link
Adversarial robustness poses a critical challenge in the deployment of deep learning models for real-world applications. Traditional approaches to adversarial training and supervised detection rely on prior knowledge of attack types and access to labeled training data, which is often impractical. Existing unsupervised adversarial detection methods identify whether the target model works properly, but they suffer from bad accuracies owing to the use of common cross-entropy training loss, which relies on unnecessary features and strengthens adversarial attacks. We propose new training losses to reduce useless features and the corresponding detection method without prior knowledge of adversarial attacks. The detection rate (true positive rate) against all given white-box attacks is above 93.9\% except for attacks without limits (DF($\infty$)), while the false positive rate is barely 2.5\%. The proposed method works well in all tested attack types and the false positive rates are even better than the methods good at certain types.
|
Chien Cheng Chyou · Hung-Ting Su · Winston Hsu 🔗 |
-
|
Stabilizing GNN for Fairness via Lipschitz Bounds
(
Poster
)
>
link
The Lipschitz bound, a technique from robust statistics, limits the maximum changes in output with respect to the input, considering associated irrelevant biased factors. It provides an efficient and provable method for examining the output stability of machine learning models without incurring additional computation costs. However, there has been no previous research investigating the Lipschitz bounds for Graph Neural Networks (GNNs), especially in the context of non-Euclidean data with inherent biases. This poses a challenge for constraining GNN output perturbations induced by input biases and ensuring fairness during training. This paper addresses this gap by formulating a Lipschitz bound for GNNs operating on attributed graphs, and analyzing how the Lipschitz constant can constrain output perturbations induced by biases for fairness training. The effectiveness of the Lipschitz bound is experimentally validated in limiting model output biases. Additionally, from a training dynamics perspective, we demonstrate how the theoretical Lipschitz bound can effectively guide GNN training to balance accuracy and fairness. |
Yaning Jia · Chunhui Zhang 🔗 |
-
|
Equal Long-term Benefit Rate: Adapting Static Fairness Notions to Sequential Decision Making
(
Poster
)
>
link
Decisions made by machine learning models may have lasting impacts over time, making long-term fairness a crucial consideration. It has been shown that when ignoring the long-term effect of decisions, naively imposing fairness criterion in static settings can actually exacerbate bias over time. To explicitly address biases in sequential decision-making, recent works formulate long-term fairness notions in Markov Decision Process (MDP) framework. They define the long-term bias to be the sum of static bias over each time step. However, we demonstrate that naively summing up the step-wise bias can cause a false sense of fairness since it fails to consider the importance difference of states during transition. In this work, we introduce a new long-term fairness notion called Equal Long-term Benefit Rate (ELBERT), which explicitly considers state importance and can preserve the semantics of static fairness principles in the sequential setting. Moreover, we show that the policy gradient of Long-term Benefit Rate can be analytically reduced to standard policy gradient. This makes standard policy optimization methods applicable for reducing the bias, leading to our proposed bias mitigation method ELBERT-PO. Experiments on three dynamical environments show that ELBERT-PO successfully reduces bias and maintains high utility. |
Yuancheng Xu · Chenghao Deng · Yanchao Sun · Ruijie Zheng · xiyao wang · Jieyu Zhao · Furong Huang 🔗 |
-
|
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
(
Poster
)
>
link
Recent advances in instruction-following large language models (LLMs) have led to dramatic improvements in a range of NLP tasks. Unfortunately, we find that the same improved capabilities amplify the dual-use risks for malicious purposes of these models. Dual-use is difficult to prevent as instruction-following capabilities now enable standard attacks from computer security. The capabilities of these instruction-following LLMs provide strong economic incentives for dual-use by malicious actors. In particular, weshow that instruction-following LLMs can produce targeted malicious content, including hate speech and scams, bypassing in-the-wild defenses implemented by LLM API vendors. Our analysis shows that this content can be generated economically and at cost likely lower than with human effort alone. Together, our findings suggest that LLMs will increasingly attract more sophisticated adversaries and attacks, and addressing these attacks may require new approaches to mitigations. |
Daniel Kang · Xuechen Li · Ion Stoica · Carlos Guestrin · Matei Zaharia · Tatsunori Hashimoto 🔗 |
-
|
Certifying Ensembles: A General Certification Theory with S-Lipschitzness
(
Poster
)
>
link
Improving and guaranteeing the robustness of deep learning models has been a topic of intense research. Ensembling, which combines several classifiers to provide a better model, has been shown to be beneficial for generalisation, uncertainty estimation, calibration, and mitigating the effects of concept drift. However, the impact of ensembling on certified robustness is less well understood. In this work, we generalise Lipschitz continuity by introducing S-Lipschitz classifiers, which we use to analyse the theoretical robustness of ensembles. Our results are precise conditions when ensembles of robust classifiers are more robust than any constituent classifier, as well as conditions when they are less robust. |
Aleksandar Petrov · Francisco Eiras · Amartya Sanyal · Phil Torr · Adel Bibi 🔗 |
-
|
On the Limitations of Model Stealing with Uncertainty Quantification Models
(
Poster
)
>
link
Model stealing aims at inferring a victim model's functionality at a fraction of the original training cost.While the goal is clear, in practice the model's architecture, weight dimension, and original training data can not be determined exactly, leading to mutual uncertainty during stealing.In this work, we explicitly tackle this uncertainty by generating multiple possible networks and combining their predictions to improve the quality of the stolen model.For this, we compare five popular uncertainty quantification models in a model stealing task.Surprisingly, our results indicate that the considered models only lead to marginal improvements in terms of label agreement (i.e., fidelity) to the stolen model.To find the cause of this, we inspect the diversity of the model's prediction by looking at the prediction variance as a function of training iterations. We realize that during training, the models tend to have similar predictions, indicating that the network diversity we wanted to leverage using uncertainty quantification models is not (high) enough for improvements on the model stealing task. |
David Pape · Sina Däubener · Thorsten Eisenhofer · Antonio Emanuele Cinà · Lea Schönherr 🔗 |
-
|
PAC-Bayesian Adversarially Robust Generalization Bounds for Deep Neural Networks
(
Poster
)
>
link
Deep neural networks (DNNs) are vulnerable to adversarial attacks. It is found empirically that adversarially robust generalization is crucial in establishing defense algorithms against adversarial attacks. Therefore, it is interesting to study the theoretical guarantee of robust generalization. This paper focuses on PAC-Bayes analysis (Neyshabur et al., 2017). The main challenge lies in extending the key ingredient, which is a weight perturbation bound in standard settings, to the robust settings. Existing attempts heavily rely on additional strong assumptions, leading to loose bounds. In this paper, we address this issue and provide a spectrally-normalized robust generalization bound for DNNs. Our bound is at least as tight as the standard generalization bound, differing only by a factor of the perturbation strength $\epsilon$. In comparison to existing robust generalization bounds, our bound offers two significant advantages: 1) it does not depend on additional assumptions, and 2) it is considerably tighter. We present a framework that enables us to derive more general results. Specifically, we extend the main result to 1) adversarial robustness against general non-$\ell_p$ attacks, and 2) other neural network architectures, such as ResNet.
|
Jiancong Xiao · Ruoyu Sun · Zhi-Quan Luo 🔗 |
-
|
Sentiment Perception Adversarial Attacks on Neural Machine Translation Systems
(
Poster
)
>
link
With the advent of deep learning methods, Neural Machine Translation (NMT) systems have become increasingly powerful. However, deep learning based systems are susceptible to adversarial attacks, where imperceptible changes to the input can cause undesirable changes at the output of the system. To date there has been little work investigating adversarial attacks on sequence-to-sequence systems, such as NMT models. Previous work in NMT has examined attacks with the aim of introducing target phrases in the output sequence. In this work, adversarial attacks for NMT systems are explored from an output perception perspective. Thus the aim of an attack is to change the perception of the output sequence, without altering the perception of the input sequence. For example, an adversary may distort the sentiment of translated reviews to have an exaggerated positive sentiment. In practice it is challenging to run extensive human perception experiments, so a proxy deep-learning classifier applied to the NMT output is used to measure perception changes. Experiments demonstrate that the sentiment perception of NMT systems' output sequences can be changed significantly with small imperceptible changes to input sequences. |
Vyas Raina · Mark Gales 🔗 |
-
|
(Almost) Provable Error Bounds Under Distribution Shift via Disagreement Discrepancy
(
Poster
)
>
link
We derive an (almost) guaranteed upper bound on the error of deep neural networks under distribution shift using unlabeled test data. Prior methods either give bounds that are vacuous in practice or give \emph{estimates} that are accurate on average but heavily underestimate error for a sizeable fraction of shifts. Our bound requires a simple, intuitive condition which is well justified by prior empirical works and holds in practice effectively 100\% of the time. The bound is inspired by $\hdh$-divergence but is easier to evaluate and substantially tighter, consistently providing non-vacuous guarantees. Estimating the bound requires optimizing one multiclass classifier to disagree with another, for which some prior works have used sub-optimal proxy losses; we devise a "disagreement loss" which is theoretically justified and performs better in practice. Across a wide range of benchmarks, our method gives valid error bounds while achieving average accuracy comparable to competitive estimation baselines.
|
Elan Rosenfeld · Saurabh Garg 🔗 |
-
|
Feature Partition Aggregation: A Fast Certified Defense Against a Union of $\ell_0$ Attacks
(
Poster
)
>
link
Sparse or $\ell_0$ adversarial attacks arbitrarily perturb an unknown subset of the features. $\ell_0$ robustness analysis is particularly well-suited for heterogeneous (tabular) data where features have different types or scales. State-of-the-art $\ell_0$ certified defenses are based on randomized smoothing and apply to evasion attacks only. This paper proposes feature partition aggregation (FPA) - a certified defense against the union of $\ell_0$ evasion, backdoor, and poisoning attacks. FPA generates its stronger robustness guarantees via an ensemble whose submodels are trained on disjoint feature sets. Compared to state-of-the-art $\ell_0$ defenses, FPA is up to $3,000\times$ faster and provides median robustness guarantees up to $4\times$ larger, meaning FPA provides the additional dimensions of robustness essentially for free.
|
Zayd S Hammoudeh · Daniel Lowd 🔗 |
-
|
Near Optimal Adversarial Attack on UCB Bandits
(
Poster
)
>
link
I study a stochastic multi-arm bandit problem where rewards are subject to adversarial corruption. At each round, the learner chooses an arm, and a stochastic reward is generated. The adversary strategically adds corruption to the reward, and the learner is only able to observe the corrupted reward at each round. I propose a novel attack strategy that manipulates a learner employing the upper-confidence-bound (UCB) algorithm into pulling some non-optimal target arm $T - o(T)$ times with a cumulative cost that scales as $\widehat{O}(\sqrt{\log T})$, where $T$ is the number of rounds. I also prove the first lower bound on the cumulative attack cost. The lower bound matches the upper bound up to $O(\log \log T)$ factors, showing the proposed attack strategy to be near optimal.
|
Shiliang Zuo 🔗 |
-
|
Learning Exponential Families from Truncated Samples
(
Poster
)
>
link
Missing data problems have many manifestations across many scientific fields. A fundamental type of missing data problem arises when samples are \textit{truncated}, i.e., samples that lie in a subset of the support are not observed. Statistical estimation from truncated samples is a classical problem in statistics which dates back to Galton, Pearson, and Fisher. A recent line of work provides the first efficient estimation algorithms for the parameters of a Gaussian distribution and for linear regression with Gaussian noise.In this paper we generalize these results to log-concave exponential families. We provide an estimation algorithm that shows that \textit{extrapolation} is possible for a much larger class of distributions while it maintains a polynomial sample and time complexity. Our algorithm is based on Projected Stochastic Gradient Descent and is not only applicable in a more general setting but is also simpler and more efficient than recent algorithms. Our work also has interesting implications for learning general log-concave distributions and sampling given only access to truncated data. |
Jane Lee · Andre Wibisono · Manolis Zampetakis 🔗 |
-
|
Identifying Adversarially Attackable and Robust Samples
(
Poster
)
>
link
Adversarial attacks insert small, imperceptible perturbations to input samples that cause large, undesired changes to the output of deep learning models. Despite extensive research on generating adversarial attacks and building defense systems, there has been limited research on understanding adversarial attacks from an input-data perspective. This work introduces the notion of sample attackability, where we aim to identify samples that are most susceptible to adversarial attacks (attackable samples) and conversely also identify the least susceptible samples (robust samples). We propose a deep-learning-based detector to identify the adversarially attackable and robust samples in an unseen dataset for an unseen target model. Experiments on standard image classification datasets enables us to assess the portability of the deep attackability detector across a range of architectures. We find that the deep attackability detector performs better than simple model uncertainty-based measures for identifying the attackable/robust samples. This suggests that uncertainty is an inadequate proxy for measuring sample distance to a decision boundary. In addition to better understanding adversarial attack theory, it is found that the ability to identify the adversarially attackable and robust samples has implications for improving the efficiency of sample-selection tasks. |
Vyas Raina · Mark Gales 🔗 |
-
|
Toward Testing Deep Learning Library via Model Fuzzing
(
Poster
)
>
link
The increasing adoption of deep learning (DL) technologies in safety-critical industries has brought about a corresponding rise in security challenges. While the security of DL frameworks (Tensorflow, Pytorch, PaddlePaddle), which serve as the foundation of various DL models, has not garnered the attention they rightfully deserve. The vulnerabilities of DL frameworks can cause significant security risks such as model reliability and data leakage. In this research project, we address this challenge by employing a specifically designed model fuzzing method. Firstly, we generate diverse models to test library implementations in the training and prediction phases by optimized mutation strategies. Furthermore, we consider the seed performance score including coverage, discovery time, and mutation numbers to prioritize the selection of model seeds. Our algorithm also selects the optimal mutation strategy based on heuristics to expand inconsistencies. Finally, to evaluate the effectiveness of our scheme, we implement our test framework and conduct the experiment on existing DL frameworks. The preliminary results demonstrate that this is a promising direction. |
Wei Kong · huayang cao · Tong Wang · Yuanping Nie · hu li · Xiaohui Kuang 🔗 |
-
|
Adversarial Attacks and Defenses in Explainable Artificial Intelligence: A Survey
(
Poster
)
>
link
Explainable artificial intelligence (XAI) methods are portrayed as a remedy for debugging and trusting statistical and deep learning models, as well as interpreting their predictions. However, recent advances in adversarial machine learning highlight the limitations and vulnerabilities of state-of-the-art explanations, putting their security and trustworthiness into question. The possibility of manipulating, fooling or fairwashing evidence of the model's reasoning has detrimental consequences when applied in high-stakes decision-making and knowledge discovery. This concise survey of over 50 papers summarizes research concerning adversarial attacks on explanations of machine learning models, as well as fairness metrics. We discuss how to defend against attacks and design robust interpretation methods. We contribute a list of existing insecurities in XAI and outline the emerging research directions in adversarial XAI (AdvXAI). |
Hubert Baniecki · Przemyslaw Biecek 🔗 |
-
|
Sharpness-Aware Minimization Alone can Improve Adversarial Robustness
(
Poster
)
>
link
Sharpness-Aware Minimization (SAM) is an effective method for improving generalization ability by regularizing loss sharpness. In this paper, we explore SAM in the context of adversarial robustness. We find that using only SAM can achieve superior adversarial robustness without sacrificing clean accuracy compared to standard training, which is an unexpected benefit. We also discuss the relation between SAM and adversarial training (AT), a popular method for improving the adversarial robustness of DNNs. In particular, we show that SAM and AT differ in terms of perturbation strength, leading to different accuracy and robustness trade-offs. We provide theoretical evidence for these claims in a simplified model. Finally, while AT suffers from decreased clean accuracy and computational overhead, we suggest that SAM can be regarded as a lightweight substitute for AT under certain requirements. Code is available at https://github.com/weizeming/SAM_AT. |
Zeming Wei · Jingyu Zhu · Yihao Zhang 🔗 |
-
|
On feasibility of intent obfuscating attacks
(
Poster
)
>
link
Intent obfuscation is a common tactic in adversarial situations, enabling the attacker to both manipulate the target system and avoid culpability. Surprisingly, it has rarely been implemented in adversarial attacks on machine learning systems. We are the first to propose incorporating intent obfuscation in generating adversarial examples for object detectors: by perturbing another non-overlapping object to disrupt the target object, the attacker hides their intended target. We conduct a randomized experiment on 5 prominent detectors---YOLOv3, SSD, RetinaNet, Faster R-CNN, and Cascade R-CNN---using both targeted and untargeted attacks and achieve success on all models and attacks. We analyze the success factors characterizing intent obfuscating attacks, including target object confidence and perturb object sizes. We then demonstrate that the attacker can exploit these success factors to increase success rates for all models and attacks. Finally, we discuss known defenses and legal repercussions. |
ZhaoBin Li · Patrick Shafto 🔗 |
-
|
Adversarial Training with Generated Data in High-Dimensional Regression: An Asymptotic Study
(
Poster
)
>
link
In recent years, studies such as \cite{carmon2019unlabeled,gowal2021improving,xing2022artificial} have demonstrated that incorporating additional real or generated data with pseudo-labels can enhance adversarial training through a two-stage training approach. In this paper, we perform a theoretical analysis of the asymptotic behavior of this method in high-dimensional linear regression. While a double-descent phenomenon can be observed in ridgeless training, with an appropriate $\mathcal{L}_2$ regularization, the two-stage adversarial training achieves a better performance. Finally, we derive a shortcut cross-validation formula specifically tailored for the two-stage training method.
|
Yue Xing 🔗 |
-
|
Benchmarking the Reliability of Post-training Quantization: a Particular Focus on Worst-case Performance
(
Oral
)
>
link
The reliability of post-training quantization (PTQ) methods in the face of extreme cases such as distribution shift and data noise remains largely unexplored, despite the popularity of PTQ as a method for compressing deep neural networks (DNNs) without altering their original architecture or training procedures. This paper conducts an investigation on commonly-used PTQ methods, addressing research questions pertaining to the impact of calibration set distribution variations, calibration paradigm selection, and data augmentation or sampling strategies on the reliability of PTQ. Through a systematic evaluation process encompassing various tasks and commonly-used PTQ paradigms, it is evident that the majority of existing PTQ methods lack the necessary reliability for worst-case group performance, underscoring the imperative for more robust approaches. |
🔗 |
-
|
Establishing a Benchmark for Adversarial Robustness of Compressed Deep Learning Models after Pruning
(
Oral
)
>
link
The increasing size of Deep Neural Networks (DNNs) poses a pressing need for model compression, particularly when employed on resource-constrained devices. Concurrently, the susceptibility of DNNs to adversarial attacks presents another significant hurdle. Despite substantial research on both model compression and adversarial robustness, their joint examination remains underexplored.Our study bridges this gap, seeking to understand the effect of adversarial inputs crafted for base models on their pruned versions.To examine this relationship, we have developed a comprehensive benchmark across diverse adversarial attacks and popular DNN models. We uniquely focus on models not previously exposed to adversarial training and apply pruning schemes optimized for accuracy and performance. Our findings reveal that while the benefits of pruning -- enhanced generalizability, compression, and faster inference times -- are preserved, adversarial robustness remains comparable to the base model. This suggests that model compression while offering its unique advantages, does not undermine adversarial robustness. |
🔗 |
-
|
Robustness through Loss Consistency Regularization
(
Oral
)
>
link
While deep learning through empirical risk minimization (ERM) has succeeded at achieving human-level performance at a variety of complex tasks, ERM is not robust to distribution shifts or adversarial attacks. Synthetic data augmentation followed by empirical risk minimization (DA-ERM) is a simple and widely used solution to improve robustness in ERM. In addition, consistency regularization can be applied to further improve the robustness of the model by forcing the representation of the original sample and the augmented one to be similar. However, existing consistency regularization methods are not applicable to covariant data augmentation, where the label in the augmented sample is dependent on the augmentation function. In this paper, we propose data augmented loss invariant regularization (DAIR), a simple form of consistency regularization that is applied directly at the loss level rather than intermediate features, making it widely applicable to both invariant and covariant data augmentation regardless of network architecture, problem setup, and task. We apply DAIR to real-world learning problems involving covariant data augmentation: robust neural task-oriented dialog state tracking and robust visual question answering. We also apply DAIR to tasks involving invariant data augmentation: robust regression, robust classification against adversarial attacks, and robust ImageNet classification under distribution shift. Our experiments show that DAIR consistently outperforms ERM and DA-ERM with little marginal computational cost and sets new state-of-the-art results in several benchmarks involving covariant data augmentation. |
🔗 |
-
|
Expressivity of Graph Neural Networks Through the Lens of Adversarial Robustness
(
Oral
)
>
link
We perform the first adversarial robustness study into Graph Neural Networks (GNNs) that are provably more powerful than traditional Message Passing Neural Networks (MPNNs). In particular, we use adversarial robustness as a tool to uncover a significant gap between their theoretically possible and empirically achieved expressive power. To do so, we focus on the ability of GNNs to count specific subgraph patterns, which is an established measure of expressivity, and extend the concept of adversarial robustness to this task. Based on this, we develop efficient adversarial attacks for subgraph counting and show that more powerful GNNs fail to generalize even to small perturbations to the graph's structure. Expanding on this, we show that such architectures also fail to count substructures on out-of-distribution graphs. |
🔗 |
-
|
Introducing Vision into Large Language Models Expands Attack Surfaces and Failure Implications
(
Oral
)
>
link
Recently, there has been a surge of interest in introducing vision into Large Language Models (LLMs). The proliferation of large Visual Language Models (VLMs), such as Flamingo, BLIP-2, and GPT-4, signifies an exciting convergence of advancements in both visual and language foundation models. Yet, risks associated with this integrative approach are largely unexamined. We shed light on the security implications of this trend. First, we underscore that the continuous and redundant nature of the additional visual input space makes it a fertile ground for adversarial attacks. This unavoidably expands the attack surfaces of LLMs, thus complicating defenses. Specifically, we demonstrate that attackers can craft adversarial visual inputs to circumvent the safety mechanisms of LLMs, inducing biased behaviors of the models in the language domain. Second, we point out the broad functionality of LLMs, in turn, also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. By revealing these risks, we emphasize the urgent need for thorough risk assessment, robust defense strategies, and responsible deployment practices to ensure the secure and safe use of VLMs. |
🔗 |
-
|
The Future of Cyber Systems: Human-AI Reinforcement Learning with Adversarial Robustness
(
Oral
)
>
link
Integrating adversarial machine learning (AML) with cyber data representations that support reinforcement learning would unlock human-ai systems with a capacity to dynamically defend against novel attacks, robustly, at machine speed, and with human intelligence.All machine learning (ML) has an underpinning need for robustness to natural errors and malicious tampering. However, unlike many consumer/commercial models, all ML systems built for cyber will be operating in an inherently adversarial environment with skilled adversaries taking advantage of any flaw. This paper outlines the research challenges, integration points, and programmatic importanceof such a system, while highlighting the social and scientific benefits of pursuing this ambitious program. |
🔗 |
-
|
Provably Robust Cost-Sensitive Learning via Randomized Smoothing
(
Oral
)
>
link
We focus on learning adversarially robust classifiers under a cost-sensitive scenario, where the potential harm of different class-wise adversarial transformations is encoded in a cost matrix. Existing methods are either empirical that can not certify cost-sensitive robustness or suffer from inherent scalability issues. In this work, we study whether randomized smoothing, a more scalable robustness certification framework can be leveraged to certify the cost-sensitive robustness. We first show how to extend the vanilla randomized smoothing pipeline to provide rigorous guarantees for cost-sensitive robustness for arbitrary binary cost matrices. However, when extending the standard smoothed classifier training method to cost-sensitive settings, the naive reweighting scheme does not achieve the desired performance due to the indirect optimization of the base classifier. Inspired by this observation, we propose a more direct training method with fine-grained certified radius optimization schemes designed for different data subgroups. Experiments on image benchmark datasets demonstrate that without sacrificing the overall accuracy, our method significantly improves certified cost-sensitive robustness. |
🔗 |
-
|
Like Oil and Water: Group Robustness and Poisoning Defenses Don’t Mix
(
Oral
)
>
link
Group robustness has become a major concern in machine learning (ML) as conventional training paradigms were found to produce high error on minority groups. Without explicit group annotations, proposed solutions rely on heuristics that aim to identify and then amplify the minority samples during training. In our work, we first uncover a critical shortcoming of these heuristics: an inability to distinguish legitimate minority samples from poison samples in the training set. By amplifying poison samples as well, group robustness methods inadvertently boost the success rate of an adversary---e.g., from 0\% without amplification to over 97\% with it. Moreover, scrutinizing recent poisoning defenses both in centralized and federated learning, we observe that they rely on similar heuristics to identify which samples should be eliminated as poisons. In consequence, minority samples are eliminated along with poisons, which damages group robustness---e.g., from 55\% without the removal of the minority samples to 41\% with it. Finally, as they pursue opposing goals using similar heuristics, our attempts to conciliate group robustness and poisoning defenses come up short. We hope our work highlights how benchmark-driven ML scholarship can obscure the tensions between different metrics, potentially leading to harmful consequences. |
🔗 |
-
|
Provable Instance Specific Robustness via Linear Constraints
(
Oral
)
>
link
Deep Neural Networks (DNNs) trained for classification tasks are vulnerable to adversarial attacks. But not all the classes are equally vulnerable. Adversarial training does not make all classes or groups equally robust as well. For example, in classification tasks with long-tailed distributions, classes are asymmetrically affected during adversarial training, with lower robust accuracy for less frequent classes. In this regard, we propose a provable robustness method by leveraging the continuous piecewise-affine (CPA) nature of DNNs. Our method can impose linearity constraints on the decision boundary, as well as the DNN CPA partition, without requiring any adversarial training. Using such constraints, we show that the margin between the decision boundary and minority classes can be increased in a provable manner. We also present qualitative and quantitative validation of our method for class-specific robustness. |
🔗 |
-
|
Adversarial Training in Continuous-Time Models and Irregularly Sampled Time-Series: A First Look
(
Oral
)
>
link
This study presents the first steps of exploring the effects of adversarial training on continuous-time models and irregularly sampled time series data. Historically, these models and sampling techniques have been largely neglected in adversarial learning research, leading to a significant gap in our understanding of their performance under adversarial conditions. To address this, we conduct an empirical study of adversarial training techniques applied to time-continuous model architectures and sampling methods. Our findings suggest that while continuous-time models tend to outperform their discrete counterparts when trained conventionally, this performance advantage diminishes almost entirely when adversarial training is employed. This indicates that adversarial training may interfere with the time-continuous representation, effectively neutralizing the benefits typically associated with these models. We believe these first insights will be important for guiding further studies and advancements in the understanding of adversarial learning in continuous-time models. |
🔗 |
-
|
Few-shot Anomaly Detection via Personalization
(
Oral
)
>
link
Even with a plenty amount of normal samples, anomaly detection has been considered as a challenging machine learning task due to its one-class nature, i.e., the lack of anomalous samples in training time. It is only recently that a few-shot regime of anomaly detection became feasible in this regard, e.g., with a help from large vision-language pre-trained models such as CLIP, despite its wide applicability. In this paper, we explore the potential of large text-to-image generative models in performing few-shot anomaly detection. Specifically, recent text-to-image models have shown unprecedented ability to generalize from few images to extract their common and unique concepts, and even encode them into a textual token to "personalize" the model: so-called textual inversion. Here, we question whether this personalization is specific enough to discriminate the given images from their potential anomalies, which are often, e.g., open-ended, local, and hard-to-detect. We observe that the standard textual inversion is not enough for detecting anomalies accurately, and thus we propose a simple-yet an effective regularization scheme to enhance its specificity derived from the zero-shot transferability of CLIP. We also propose a self-tuning scheme to further optimize the performance of our detection pipeline, leveraging synthetic data generated from the personalized generative model. Our experiments show that the proposed inversion scheme could achieve state-of-the-art results on a wide range of few-shot anomaly detection benchmarks. |
🔗 |
-
|
Rethinking Label Poisoning for GNNs: Pitfalls and Attacks
(
Oral
)
>
link
Node labels for graphs are usually generated using an automated process, or crowd-sourced from human users. This opens up avenues for malicious users to compromise the training labels, making it unwise to blindly rely on them. While robustness against noisy labels is an active area of research, there are only a handful of papers in the literature that address this for graph-based data. Even more so, the effects of adversarial label perturbations are sparsely studied. A recent work revealed that the entire literature on label poisoning for GNNs is plagued by serious evaluation pitfalls and showed how existing attacks render ineffective post fixing these shortcomings. In this work, we introduce two new simple yet effective attacks that are significantly stronger (up to $\sim8\%$) than the previous strongest attack. Our work demonstrates the need for more robust defense mechanisms, especially considering the \emph{transferability} of our attacks, where a strategy devised for one model can effectively contaminate numerous other models.
|
🔗 |
-
|
Shrink & Cert: Bi-level Optimization for Certified Robustness
(
Oral
)
>
link
In this paper, we advance the concept of shrinking weights to train certifiably robust models from the fresh perspective of gradient-based bi-level optimization. Lack of robustness against adversarial attacks remains a challenge in safety-critical applications. Many attempts have been made in literature which only provide empirical verification of the defenses to certain attacks and can be easily broken. Methods in other lines of work can only develop certified guarantees of the model robustness in limited scenarios and are computationally expensive. We present a weight shrinkage formulation that is computationally inexpensive and can be solved as a simple first-order optimization problem. We show that model trained with our method has lower Lipschitz bounds in each layer, which directly provides formal guarantees on the certified robustness. We demonstrate that our approach, Shrink \& Cert (SaC) achieves provably robust networks which simultaneously give excellent standard and robust accuracy. We demonstrate the success of our approach on CIFAR-10 and ImageNet datasets and compare them with existing robust training techniques. Code : \url{https://github.com/sagarverma/BiC} |
🔗 |
-
|
Preventing Reward Hacking with Occupancy Measure Regularization
(
Oral
)
>
link
Reward hacking occurs when an agent exploits its specified reward function to behave in undesirable or unsafe ways. Aside from better alignment between the specified reward function and the system designer's intentions, a more feasible proposal to prevent reward hacking is to regularize the learned policy to some safe baseline. Current research suggests that regularizing the learned policy's action distributions to be more similar to those of a safe policy can mitigate reward hacking; however, this approach fails to take into account the disproportionate impact that some actions have on the agent’s state. Instead, we propose a method of regularization based on occupancy measures, which capture the proportion of time each policy is in a particular state-action pair during trajectories. We show theoretically that occupancy-based regularization avoids many drawbacks of action distribution-based regularization, and we introduce an algorithm called ORPO to practically implement our technique. We then empirically demonstrate that occupancy measure-based regularization is superior in both a simple gridworld and a more complex autonomous vehicle control environment. |
🔗 |
-
|
Evading Black-box Classifiers Without Breaking Eggs
(
Oral
)
>
link
Decision-based evasion attacks repeatedly query a black-box classifier to generate adversarial examples.Prior work measures the cost of such attacks by the total number of queries made to the classifier. We argue this metric is flawed. Most security-critical machine learning systems aim to weed out "bad" data (e.g., malware, harmful content, etc). Queries to such systems carry a fundamentally *asymmetric cost*: queries detected as "bad" come at a higher cost because they trigger additional security filters, e.g., usage throttling or account suspension. Yet, we find that existing decision-based attacks issue a large number of "bad" queries, which likely renders them ineffective against security-critical systems. We then design new attacks that reduce the number of bad queries by $1.5$-$7.3\times$, but often at a significant increase in total (non-bad) queries. We thus pose it as an open problem to build black-box attacks that are more effective under realistic cost metrics.
|
🔗 |
-
|
Deceptive Alignment Monitoring
(
Oral
)
>
link
As the capabilities of large machine learning models continue to grow, and as the autonomy afforded to such models continues to expand, the spectre of a new adversary looms: the models themselves. The threat that a model might behave in a seemingly reasonable manner, while secretly and subtly modifying its behavior for ulterior reasons is often referred to as deceptive alignment in the AI Safety & Alignment communities. Consequently, we call this new direction Deceptive Alignment Monitoring. In this work, we identify emerging directions in diverse machine learning subfields that we believe will become increasingly important and intertwined in the near future for deceptive alignment monitoring, and we argue that advances in these fields present both long-term challenges and new research opportunities. We conclude by advocating for greater involvement by the adversarial machine learning community in these emerging directions. |
🔗 |
-
|
Baselines for Identifying Watermarked Large Language Models
(
Oral
)
>
link
We consider the emerging problem of identifying the presence and use of watermarking schemes in widely used, publicly hosted, closed source large language models (LLMs). That is, rather than determine if a given text is generated by a watermarked language model, we seek to answer the question of if the model itself is watermarked. To do so, we introduce a suite of baseline algorithms for identifying watermarks in LLMs that rely on analyzing distributions of output tokens and logits generated by watermarked and unmarked LLMs. Notably, watermarked LLMs tend to produce distributions that diverge qualitatively and identifiably from standard models. Furthermore, we investigate the identifiability of watermarks at varying strengths and consider the tradeoffs of each of our identification mechanisms with respect to watermarking scenario. |
🔗 |
-
|
Why do universal adversarial attacks work on large language models?: Geometry might be the answer
(
Oral
)
>
link
Transformer based large language models with emergent capabilities are becoming increasingly ubiquitous in society. However, the task of understanding and interpreting their internal workings, in the context of adversarial attacks, remains largely unsolved. Gradient-based universal adversarial attacks have been shown to be highly effective on large language models and potentially dangerous due to their input-agnostic nature. This work presents a novel geometric perspective explaining universal adversarial attacks on large language models. By attacking the 117M parameter GPT-2 model, we find evidence indicating that universal adversarial triggers could be embedding vectors which merely approximate the semantic information in their adversarial training region. This hypothesis is supported by white-box model analysis comprising dimensionality reduction and similarity measurement of hidden representations. We believe this new geometric perspective on the underlying mechanism driving universal attacks could help us gain deeper insight into the internal workings and failure modes of LLMs, thus enabling their mitigation. |
🔗 |
-
|
FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation
(
Oral
)
>
link
We present FACADE, a novel probabilistic and geometric framework designed for unsupervised mechanistic anomaly detection in deep neural networks. Its primary goal is advancing the understanding and mitigation of adversarial attacks. FACADE aims to generate probabilistic distributions over circuits, which provide critical insights to their contribution to changes in the manifold properties of pseudo-classes, or high-dimensional modes in activation space, yielding a powerful tool for uncovering and combating adversarial attacks. Our approach seeks to improve model robustness, enhance scalable model oversight, and demonstrates promising applications in real-world deployment settings. |
🔗 |
-
|
Robust Deep Learning via Layerwise Tilted Exponentials
(
Oral
)
>
link
State-of-the-art techniques for enhancing robustness of deep networks mostly rely on end-to-end training with suitable data augmentation. In this paper, we propose a complementary approach aimed at enhancing the signal-to-noise ratio at intermediate network layers, loosely motivated by the classical communication-theoretic model of signaling in Gaussian noise. We seek to learn neuronal weights which are matched to the layer inputs by supplementing end-to-end costs with a tilted exponential (TEXP) objective function which depends on the activations at the layer outputs. We show that TEXP learning can be interpreted as maximum likelihood estimation of matched filters under a Gaussian model for data noise. TEXP inference is accomplished by replacing batch norm by a tilted softmax enforcing competition across neurons, which can be interpreted as computation of posterior probabilities for the signaling hypotheses represented by each neuron. We show, by experimentation on standard image datasets, that TEXP learning and inference enhances robustness against noise, other common corruptions and mild adversarial perturbations, without requiring data augmentation. Further gains in robustness against this array of distortions can be obtained by appropriately combining TEXP with adversarial training. |
🔗 |
-
|
Learning Shared Safety Constraints from Multi-task Demonstrations
(
Oral
)
>
link
Regardless of the particular task we want them to perform in an environment, there are often shared safety constraints we want our agents to respect. For example, regardless of whether it is making a sandwich or clearing the table, a kitchen robot should not break a plate. Manually specifying such a constraint can be both time-consuming and error-prone. We show how to learn constraints from expert demonstrations of safe task completion by extending inverse reinforcement learning (IRL) techniques to the space of constraints. Intuitively, we learn constraints that forbid highly rewarding behavior that the expert could have taken but chose not to. Unfortunately, the constraint learning problem is rather ill-posed and typically leads to overly conservative constraints that forbid all behavior that the expert did not take. We counter this by leveraging diverse demonstrations that naturally occur in multi-task settings to learn a tighter set of constraints. We validate our method with simulation experiments on high-dimensional continuous control tasks. |
🔗 |
-
|
Teach GPT To Phish
(
Oral
)
>
link
Quantifying privacy risks in large language models (LLM) is an important research question. We take a step towards answering this question by defining a real-world threat model wherein an entity seeks to augment an LLM with private data they possess via fine-tuning.The entity also seeks to improve the quality of its LLM outputs over time by learning from human feedback.We propose a novel |
🔗 |
-
|
How Can Neuroscience Help Us Build More Robust Deep Neural Networks?
(
Oral
)
>
link
Although Deep Neural Networks (DNNs) are often compared to biological visual systems, they are far less robust to natural and adversarial examples. In contrast, biological visual systems can reliably recognize different objects under a variety of settings. While recent innovations have closed the performance gap between biological and artificial vision systems to some extent, there are still many practical differences between the two. In this Blue Sky Ideas presentation, we will identify some key differences between standard DNNs and biological perceptual systems that may contribute to this lack of robustness. We will then discuss possible avenues for future work by identifying promising DNNs that are constructed with these biological computational motifs but have hardly been examined in terms of robustness. |
🔗 |
-
|
A physics-orientd method for attacking SAR images using salient regions
(
Oral
)
>
link
The use of deep neural networks in SAR target recognition makes it vulnerable to adversarial attacks. Previous studies have utilized optical image attacks, electromagnetic scattering parameter models, and structural parameter perturbation in generating SAR adversarial example. The imaging process for SAR images in the physical world is dissimilar to that of optical images because SAR imaging is solely regulated by imaging equations rather than the what-you-see-is-what-you-get principle, as a result, generating SAR adversarial samples in the physical world requires considering the changes in SAR imaging equations that happen after deploying physical devices. Thus, this study proposes a physical attack technique reliant on salient regions to add adversarial scatterers in the physical domain, masking the salient regions identified by classifiers in SAR images, and subsequently downgrading the classification capabilities of the classifiers. In contrast to previous algorithms, the proposed algorithm distinguishes itself through two key features: (1) SAR-BagNet is utilized to identify the salient regions of SAR targets recognized by classifiers, allowing for the exact position and size determination of the adversarial scatterers and enhancing interpretability; (2) Dynamic step size optimization, which is based on the difference equation, continuously refines the electromagnetic parameters, structural parameters, and texture parameters of the adversarial scatterers, leading to a higher search efficiency. The simulation experiments demonstrated that the generated adversarial samples, after adding and modifying the design parameters of the adversarial scatterers in the initial physical model, contributed to a decrease in the classification accuracy of classifiers for the simulated images, from 100% to 14.4%,these experimental results indicate that the proposed method has considerable potential for further exploration and research on physical domain adversarial attacks in SAR. |
🔗 |
-
|
Accurate, Explainable, and Private Models: Providing Recourse While Minimizing Training Data Leakage
(
Oral
)
>
link
Machine learning models are increasingly utilized across impactful domains to predict individual outcomes. As such, many models provide algorithmic recourse to individuals who receive negative outcomes. However, recourse can be leveraged by adversaries to disclose private information. This work presents the first attempt at mitigating such attacks. We present two novel methods to generate differentially private recourse: Differentially Private Model ($\texttt{DPM}$) and Laplace Recourse ($\texttt{LR}$). Using logistic regression classifiers and real world and synthetic datasets, we find that $\texttt{DPM}$ and $\texttt{LR}$ perform well in reducing what an adversary can infer, especially at low $\texttt{FPR}$. When training dataset size is large enough, we find particular success in preventing privacy leakage while maintaining model and recourse accuracy with our novel $\texttt{LR}$ method.
|
🔗 |
-
|
Theoretically Principled Trade-off for Stateful Defenses against Query-Based Black-Box Attacks
(
Oral
)
>
link
Adversarial examples threaten the integrity of machine learning systems with alarming success rates even under constrained black-box conditions. Stateful defenses have emerged as an effective countermeasure, detecting potential attacks by maintaining a buffer of recent queries and detecting new queries that are too similar. However, these defenses fundamentally pose a trade-off between attack detection and false positive rates, and this trade-off is typically optimized by hand-picking feature extractors and similarity thresholds that empirically work well. There is little current understanding as to the formal limits of this trade-off and the exact properties of the feature extractors/underlying problem domain that influence it. This work aims to address this gap by offering a theoretical characterization of the trade-off between detection and false positive rates for stateful defenses. We provide upper bounds for detection rates of a general class of feature extractors and analyze the impact of this trade-off on the convergence of black-box attacks. We then support our theoretical findings with empirical evaluations across multiple datasets and stateful defenses. |
🔗 |
-
|
DiffScene: Diffusion-Based Safety-Critical Scenario Generation for Autonomous Vehicles
(
Oral
)
>
link
The field of Autonomous Driving (AD) has witnessed significant progress in recent years. Among the various challenges faced, the safety evaluation of autonomous vehicles (AVs) stands out as a critical concern. Traditional evaluation methods are both costly and inefficient, often requiring extensive driving mileage in order to encounter rare safety-critical scenarios, which are distributed on the long tail of the complex real-world driving landscape. In this paper, we propose a unified approach, Diffusion-Based Safety-Critical Scenario Generation (DiffScene), to generate high-quality safety-critical scenarios which are both realistic and safety-critical for efficient AV evaluation. In particular, we propose a diffusion-based generation framework, leveraging the power of approximating the distribution of low-density spaces for diffusion models. We design several adversarial optimization objectives to guide the diffusion generation under predefined adversarial budgets. These objectives, such as safety-based objective, functionality-based objective, and constraint-based objective, ensure the generation of safety-critical scenarios while adhering to specific constraints. Extensive experimentation has been conducted to validate the efficacy of our approach. Compared with 6 SOTA baselines, DiffScene generates scenarios that are (1) more safety-critical under 3 metrics, (2) more realistic under 5 distance functions, and (3) more transferable to different AV algorithms. In addition, we demonstrate that training AV algorithms with scenarios generated by DiffScene leads to significantly higher performance in terms of the safety-critical metrics compared to baselines. These findings highlight the potential of DiffScene in addressing the challenges of AV safety evaluation, paving the way for more efficient and effective AV development. |
🔗 |
-
|
Improving Adversarial Training for Multiple Perturbations through the Lens of Uniform Stability
(
Oral
)
>
link
In adversarial training (AT), most existing works focus on AT with a single type of perturbation, such as the $\ell_\infty$ attacks. However, deep neural networks (DNNs) are vulnerable to different types of adversarial examples, necessitating the development of adversarial training for multiple perturbations (ATMP). Despite the benefits of ATMP, there exists a trade-off between different types of attacks. Furthermore, there is a lack of theoretical analyses of ATMP, which hinders its further development. To address these issues, we conduct a smoothness analysis of ATMP. Our analysis reveals that $\ell_1$, $\ell_2$, and $\ell_\infty$ adversaries contribute differently to the smoothness of the loss function in ATMP. Leveraging these smoothness properties, we investigate the improvement of ATMP through the lens of uniform stability. Through our research, we demonstrate that employing an adaptive smoothness-weighted learning rate leads to enhanced uniform stability bounds, thus improving adversarial training for multiple perturbations. We validate our findings through experiments on CIFAR-10 and CIFAR-100 datasets, where our approach achieves competitive performance against various mixtures of multiple perturbation attacks. This work contributes to a deeper understanding of ATMP and provides practical insights for improving the robustness of DNNs against diverse adversarial examples.
|
🔗 |
-
|
A Theoretical Perspective on the Robustness of Feature Extractors
(
Oral
)
>
link
Recent theoretical work on robustness to adversarial examples has derived lower bounds on how robust any model can be when the distribution and adversarial constraints are specified. However, these bounds do not account for the specific models used in practice, such as neural networks. In this paper, we develop a methodology to analyze the fundamental limits on the robustness of fixed feature extractors, which in turn provides bounds on the robustness of classifiers trained on top of them. The tightness of these bounds relies on the effectiveness of the method used to find collisions between pairs of perturbed examples at deeper layers. For linear feature extractors, we provide closed-form expressions for collision finding while for piece-wise linear feature extractors, we propose a bespoke algorithm based on the iterative solution of a convex program that provably finds collisions. We utilize our bounds to identify structural features of classifiers that lead to a lack of robustness and provide insights into the effectiveness of different training methods at obtaining robust feature extractors. |
🔗 |
-
|
Characterizing the Optimal $0-1$ Loss for Multi-class Classification with a Test-time Attacker
(
Oral
)
>
link
Finding classifiers robust to adversarial examples is critical for their safe deployment. Determining the robustness of the best possible classifier under a given threat model for a fixed data distribution and comparing it to thatachieved by state-of-the-art training methods is thus an important diagnostictool. In this paper, we find achievable information-theoretic lower bounds onrobust loss in the presence of a test-time attacker for *multi-classclassifiers on any discrete dataset*. We provide a general framework for findingthe optimal $0-1$ loss that revolves around the construction of a conflicthypergraph from the data and adversarial constraints. The prohibitive cost ofthis formulation in practice leads us to formulate other variants of theattacker-classifier game that more efficiently determine the range of theoptimal loss. Our valuation shows, for the first time, an analysis of the gap tooptimal robustness for classifiers in the multi-class setting on benchmarkdatasets.
|
🔗 |
-
|
RODEO: Robust Out-of-distribution Detection via Exposing Adaptive Outliers
(
Oral
)
>
link
Detecting out-of-distribution (OOD) input samples at the inference time is a key element in the trustworthy deployment of intelligent models. While there has been a tremendous improvement in various flavors of OOD detection in recent years, the detection performance under adversarial settings lags far behind the performance in the standard setting. In order to bridge this gap, we introduce RODEO in this paper, a data-centric approach that generates effective outliers for robust OOD detection. More specifically, we first show that targeting the classification of adversarially perturbed in- and out-of-distribution samples through outlier exposure (OE) could be an effective strategy for the mentioned purpose as long as the training outliers meet certain quality standards. We hypothesize that the outliers in the OE should possess several characteristics simultaneously to be effective in the adversarial training: diversity, and both conceptual differentiability and analogy to the inliers. These aspects seem to play a more critical role in the adversarial setup compared to the standard training. Next, we propose to take advantage of existing text-to-image generative models, conditioned on the inlier or normal samples, and text prompts that minimally edit the normal samples, and turn them into near-distribution outliers. This process helps to satisfy the three mentioned criteria for the generated outliers, and significantly boosts the performance of OE specially in the adversarial setting. We demonstrate the general effectiveness of this approach in various related problems including novelty/anomaly detection, Open-Set Recognition (OSR), and OOD detection. We also make a comprehensive comparison of our method against other adaptive OE techniques under the adversarial setting to showcase its effectiveness. |
🔗 |
-
|
Rethinking Robust Contrastive Learning from the Adversarial Perspective
(
Oral
)
>
link
To advance the understanding of robust deep learning, we delve into the effects of adversarial training on self-supervised and supervised contrastive learning, alongside supervised learning. Our analysis uncovers significant disparities between adversarial and clean representations in standard-trained networks, across various learning algorithms. Remarkably, adversarial training mitigates these disparities and fosters the convergence of representations toward a universal set, regardless of the learning scheme used. Additionally, we observe that increasing the similarity between adversarial and clean representations, particularly near the end of the network, enhances network robustness. These findings offer valuable insights for designing and training effective and robust deep learning networks. |
🔗 |
-
|
TMI! Finetuned Models Spill Secrets from Pretraining
(
Oral
)
>
link
Transfer learning has become an increasingly popular technique in machine learning as a way to leverage a pretrained model trained for related tasks. This paradigm has been especially popular for \emph{privacy preserving machine learning}, where the pretrained model is considered public, and only the data for finetuning is considered sensitive. However, there are reasons to believe that the data used for pretraining is still sensitive. In this work we study privacy leakage via membership-inference attacks, and we propose a new threat model where the adversary only has access to the finetuned model and would like to infer the membership of the pretraining data. To realize this threat model, we implement a novel metaclassifier-based attack, TMI. We evaluate TMI on both vision and natural language tasks across multiple transfer learning settings, including finetuning with differential privacy. Through our evaluation, we find that TMI can successfully infer membership of pretraining examples using query access to the finetuned model. |
🔗 |
-
|
A First Order Meta Stackelberg Method for Robust Federated Learning
(
Oral
)
>
link
Previous research has shown that federated learning (FL) systems are exposed to an array of security risks. Despite the proposal of several defensive strategies, they tend to be non-adaptive and specific to certain types of attacks, rendering them ineffective against unpredictable or adaptive threats. This work models adversarial federated learning as a Bayesian Stackelberg Markov game (BSMG) to capture the defender's incomplete information of various attack types. We propose meta-Stackelberg learning (meta-SL), a provably efficient meta-learning algorithm, to solve the equilibrium strategy in BSMG, leading to an adaptable FL defense. We demonstrate that meta-SL converges to the first-order $\varepsilon$-equilibrium point in $O(\varepsilon^{-2})$ gradient iterations, with $O(\varepsilon^{-4})$ samples needed per iteration, matching the state of the art. Empirical evidence indicates that our meta-Stackelberg framework performs exceptionally well against potent model poisoning and backdoor attacks of an uncertain nature.
|
🔗 |
-
|
Backdoor Attacks for In-Context Learning with Language Models
(
Oral
)
>
link
Because state-of-the-art language model are expensive to train, most practitioners must make use of one of the few publicly available language models or language model APIs. This consolidation of trust increases the potency of backdoor attacks, where an adversary tampers with a machine learning model in order to make it perform some malicious behavior on inputs that contain a predefined backdoor trigger. We show that the in-context learning ability of large language models significantly complicates the question of developing backdoor attacks, as a successful backdoor must work against various prompting strategies and should not affect the model's general purpose capabilities. We design a new attack for eliciting targeted misclassification when language models are prompted to perform a particular target task and demonstrate the feasibility of this attack by backdooring multiple large language models ranging in size from 1.3 billion to 6 billion parameters. Finally we study defenses to mitigate the potential harms of our attack: for example, while in the white-box setting we show that fine-tuning models for as few as 500 steps suffices to remove the backdoor behavior, in the black-box setting we are unable to develop a successful defense that relies on prompt engineering alone. |
🔗 |
-
|
R-LPIPS: An Adversarially Robust Perceptual Similarity Metric
(
Oral
)
>
link
Similarity metrics have played a significant role in computer vision to capture the underlying semantics of images. In recent years, advanced similarity metrics, such as the Learned Perceptual Image Patch Similarity (LPIPS), have emerged. These metrics leverage deep features extracted from trained neural networks and have demonstrated a remarkable ability to closely align with human perception when evaluating relative image similarity. However, it is now well-known that neural networks are susceptible to adversarial examples, i.e., small perturbations invisible to humans crafted to deliberately mislead the model. Consequently, the LPIPS metric is also sensitive to such adversarial examples. This susceptibility introduces significant security concerns, especially considering the widespread adoption of LPIPS in large-scale applications. In this paper, we propose the Robust Learned Perceptual Image Patch Similarity (R-LPIPS) metric, a new metric that leverages adversarially trained deep features. Through a comprehensive set of experiments, we demonstrate the superiority of R-LPIPS compared to the classical LPIPS metric. |
🔗 |
-
|
Risk-Averse Predictions on Unseen Domains via Neural Style Smoothing
(
Oral
)
>
link
Achieving high accuracy on data from domains unseen during training is a fundamental challenge in machine learning. While state-of-the-art neural networks have achieved impressive performance on various tasks, their predictions are biased towards domain-dependent information (ex. image styles) rather than domain-invariant information (ex. image content). This makes them unreliable for deployment in risk-sensitive settings such as autonomous driving. In this work, we propose a novel inference procedure, Test-Time Neural Style Smoothing (TT-NSS), that produces risk-averse predictions using a ``style smoothed'' version of a classifier. Specifically, the style smoothed classifier classifies a test image as the most probable class predicted by the original classifier on random re-stylizations of the test image. TT-NSS uses a neural style transfer module to stylize the test image on the fly, requires black-box access to the classifier, and crucially, abstains when predictions of the original classifier on the stylized images lack consensus. We further propose a neural style smoothing-based training procedure that improves the prediction consistency and the performance of the style-smoothed classifier on non-abstained samples. Our experiments on the PACS dataset and its variations, both in single and multiple domain settings highlight the effectiveness of our methods at producing risk-averse predictions on unseen domains. |
🔗 |
-
|
A Simple and Yet Fairly Effective Defense for Graph Neural Networks
(
Oral
)
>
link
Graph neural networks (GNNs) have become the standard approach for performing machine learning on graphs. However, concerns have been raised regarding their vulnerability to small adversarial perturbations. Existing defense methods suffer from high time complexity and can negatively impact the model's performance on clean graphs. In this paper, we propose NoisyGCN, a defense method that injects noise into the GCN architecture. We derive a mathematical upper bound linking GCN's robustness to noise injection, establishing our method's effectiveness. Through empirical evaluations on the node classification task, we demonstrate superior or comparable performance to existing methods while minimizing the added time complexity. |
🔗 |
-
|
Incentivizing Honesty among Competitors in Collaborative Learning
(
Oral
)
>
link
Collaborative learning techniques have the potential to enable training machine learning models that are superior to models trained on a single entity’s data. However, in many cases, potential participants in such collaborative schemes are competitors on a downstream task, such as firms that each aim to attract customers by providing the best recommendations. This can incentivize dishonest updates that damage other participants' models, potentially undermining the benefits of collaboration. In this work, we formulate a game that models such interactions and study two learning tasks within this framework: single-round mean estimation and multi-round SGD on strongly-convex objectives. For a natural class of player actions, we show that rational clients are incentivized to strongly manipulate their updates, thus preventing learning. We then propose mechanisms that incentivize honest communication and ensure learning quality comparable to full cooperation. Our work shows that explicitly modeling the incentives and actions of dishonest clients, rather than assuming them malicious, can enable strong robustness guarantees for collaborative learning. |
🔗 |
-
|
Towards Effective Data Poisoning for Imbalanced Classification
(
Oral
)
>
link
Targeted Clean-label Data Poisoning Attacks (TCPDA) aim to manipulate training samples in a label-consistent manner to gain malicious control over targeted samples' output during deployment. A prominent class of TCDPA methods, gradient-matching based data-poisoning methods, utilize a small subset of training class samples to match the poisoned gradient of a target sample. However, their effectiveness is limited when attacking imbalanced datasets because of gradient mis-match due to training time data balancing techniques like Re-weighting and Re-sampling. In this paper, we propose two modifications that eliminate this gradient-mismatch and thereby enhance the efficacy of gradient-matching-based TCDPA on imbalanced datasets. Our methods achieve notable improvements of up to 32% (Re-sampling) and 51% (Re-weighting) in terms of Attack Effect Success Rate on MNIST and CIFAR10. |
🔗 |
-
|
Black Box Adversarial Prompting for Foundation Models
(
Oral
)
>
link
Prompting interfaces allow users to quickly adjust the output of generative models in both vision and language. However, small changes and design choices in the prompt can lead to significant differences in the output. In this work, we develop a black-box framework for generating adversarial prompts for unstructured image and text generation. These prompts, which can be standalone or prepended to benign prompts, induce specific behaviors into the generative process, such as generating images of a particular object or generating high perplexity text. |
🔗 |
-
|
Exposing the Fake: Effective Diffusion-Generated Images Detection
(
Oral
)
>
link
Image synthesis has seen significant advancements with the advent of diffusion-based generative models like Denoising Diffusion Probabilistic Models (DDPM) and text-to-image diffusion models. Despite their efficacy, there is a dearth of research dedicated to detecting diffusion generated images, which could pose potential security and privacy risks. This paper addresses this gap by proposing a novel detection method called Stepwise Error for Diffusion-generated Image Detection (SeDID). Comprising statistical-based SeDID and neural network-based SeDID, SeDID exploits the unique attributes of diffusion models, namely deterministic reverse and deterministic denoising computation errors. Our evaluations demonstrate SeDID’s superior performance over existing methods when applied to diffusionmodels. Thus, our work makes a pivotal contribution to distinguishing diffusion model-generated images, marking a significant step in the domain of artificial intelligence security. |
🔗 |
-
|
AdversNLP: A Practical Guide to Assessing NLP Robustness Against Text Adversarial Attacks
(
Oral
)
>
link
The emergence of powerful language models in natural language processing (NLP) has sparked a wave of excitement for their potential to revolutionize decision-making. However, this excitement should be tempered by their vulnerability to adversarial attacks, which are carefully perturbed inputs able to fool the model into inaccurate decisions. In this paper, we present AdversNLP, a practical framework to assess the robustness of NLP applications against text-based adversaries. Our framework combines and extends upon the technical capabilities of established NLP adversarial attacking tools (i.e. TextAttack) and tailors an audit guide to navigate the landscape of threats to NLP applications. AdversNLP illustrates best practices, and vulnerabilities through customized attacking recipes, and presenting evaluation metrics in the form of key performance indicators (KPIs). Our study demonstrates the severity of the threat posed by adversarial attacks and the need for more initiatives bridging the gap between research contributions and industrial applications. |
🔗 |
-
|
Proximal Compositional Optimization for Distributionally Robust Learning
(
Oral
)
>
link
Recently, compositional optimization (CO) has gained popularity because of its applications in distributionally robust optimization (DRO) and many other machine learning problems. Often (non-smooth) regularization terms are added to an objective to impose some structure and/or improve the generalization performance of the learned model. However, when it comes to CO, there is a lack of efficient algorithms that can solve regularized CO problems. Moreover, current state-of-the-art methods to solve such problems rely on the computation of large batch gradients (depending on the solution accuracy) not feasible for most practical settings. To address these challenges, in this work, we consider a certain regularized version of the CO problem that often arises in DRO formulations and develop a proximal algorithm for solving the problem. We perform a Moreau envelope-based analysis and establish that without the need to compute large batch gradients \anamec~achieves $\mathcal{O}(\epsilon^{-2})$ sample complexity, that matches the vanilla SGD guarantees for solving non-CO problems. We corroborate our theoretical findings with empirical studies on large-scale DRO problems.
|
🔗 |
-
|
PIAT: Parameter Interpolation based Adversarial Training for Image Classification
(
Oral
)
>
link
Adversarial training has been demonstrated to be the most effective approach to defend against adversarial attacks. However, existing adversarial training methods show apparent oscillations and overfitting issues in the training process, degrading the defense efficacy. In this work, we propose a novel framework, termed Parameter Interpolation based Adversarial Training (PIAT), that makes full use of the historical information during training. Specifically, at the end of each epoch, PIAT tunes the model parameters as the interpolation of the parameters of the previous and current epochs. Besides, we suggest to use the Normalized Mean Square Error (NMSE) to further improve the robustness by aligning the relative magnitude of logits between clean and adversarial examples, rather than the absolute magnitude. Extensive experiments on several benchmark datasets and various networks show that our framework could prominently improve the model robustness and reduce the generalization error. |
🔗 |
-
|
Mathematical Theory of Adversarial Deep Learning
(
Oral
)
>
link
In this Show-and-Tell Demos paper, progresses on mathematical theories for adversarial deep learning are reported.Firstly, achieving robust memorization for certain neural networks is shown to be an NP-hard problem. Furthermore, neural networks with $O(Nn)$ parameters are constructed for optimal robust memorization of any dataset with dimension $n$ and size $N$ in polynomial time. Secondly, adversarial training is formulated as a Stackelberg game and is shown to result in a network with optimal adversarial accuracy when the Carlini-Wagner's margin loss is used. Finally, the bias classifier is introduced and is shown to be information-theoretically secure against the original-model gradient-based attack.
|
🔗 |
-
|
Adapting Robust Reinforcement Learning to Handle Temporally-Coupled Perturbations
(
Oral
)
>
link
Recent years have witnessed the development of robust training to defend against the vulnerability of RL policies. Existing threat models impose static constraints on perturbations at each timestep and overlook the temporal influence of past perturbations on the current ones, despite its crucial consideration in many real-world scenarios. We formally introduce temporally-coupled attacks to account for the temporal coupling between perturbations at consecutive time steps, presenting a novel challenge for existing robust RL methods. To tackle this challenge, we propose GRAD, a novel game-theoretic response approach that treats the temporally-coupled robust RL problem as a partially-observable two-player game. By finding an approximate equilibrium in our approach, GRAD ensures the agent's robustness against the learned adversary. Empirical experiments on a variety of continuous control tasks demonstrate that our proposed approach exhibits significant robustness advantages compared to baselines against both standard and temporally-coupled attacks, in both the state and action spaces. |
🔗 |
-
|
Navigating Graph Robust Learning against All-Intensity Attacks
(
Oral
)
>
link
Graph Neural Networks have demonstrated exceptional performance in a variety of graph learning tasks, but their vulnerability to adversarial attacks remains a major concern. Accordingly, many defense methods have been developed to learn robust graph representations and mitigate the impact of adversarial attacks. However, most of the existing methods suffer from two major drawbacks: {(i) their robustness degrades under higher-intensity attacks}, and {(ii) they cannot scale to large graphs.} In light of this, we develop a novel graph defense method to address these limitations. Our method first applies a denoising module to recover a cleaner graph by removing edges associated with attacked nodes, then, it utilizes Mixture-of-Experts to select differentially private noises of different magnitudes to counteract the node features attacked at different intensities. In addition, the overall design of our method avoids relying on heavy adjacency matrix computations like SVD, thus enabling the framework's applicability on large graphs. |
🔗 |
-
|
Towards Out-of-Distribution Adversarial Robustness
(
Oral
)
>
link
Adversarial robustness continues to be a major challenge for deep learning. A core issue is that robustness to one type of attack often fails to transfer to other attacks. While prior work establishes a theoretical trade-off in robustness against different $L_p$ norms, we show that there is potential for improvement against many commonly used attacks by adopting a domain generalisation approach.Concretely, we treat each type of attack as a domain, and apply the Risk Extrapolation method (REx), which promotes similar levels of robustness against all training attacks. Compared to existing methods, we obtain similar or superior worst-case adversarial robustness on attacks seen during training. Moreover, we achieve superior performance on families or tunings of attacks only encountered at test time. On ensembles of attacks, our approach improves the accuracy from 3.4\% the best existing baseline to 25.9\% on MNIST, and from 16.9\% to 23.5\% on CIFAR10.
|
🔗 |
-
|
Generalizable Lightweight Proxy for Robust NAS against Diverse Perturbations
(
Oral
)
>
link
Recent neural architecture search (NAS) frameworks have been successful in finding optimal architectures for given conditions (e.g., performance or latency). However, they search for optimal architectures in terms of their performance on clean images only, while robustness against various types of perturbations or corruptions is crucial in practice. Although there exist several robust NAS frameworks that tackle this issue by integrating adversarial training into one-shot NAS, however, they are limited in that they only consider robustness against adversarial attacks and require significant computational resources to discover optimal architectures for a single task, which makes them impractical in real-world scenarios. To address these challenges, we propose a novel lightweight robust zero-cost proxy that considers the consistency across features, parameters, and gradients of both clean and perturbed images at the initialization state. Our approach facilitates an efficient and rapid search for neural architectures capable of learning generalizable features that exhibit robustness across diverse perturbations. The experimental results demonstrate that our proxy can rapidly and efficiently search for neural architectures that are consistently robust against various perturbations on multiple benchmark datasets and diverse search spaces, largely outperforming existing clean zero-shot NAS and robust NAS with reduced search cost. |
🔗 |
-
|
Adversarial Robustness for Tabular Data through Cost and Utility Awareness
(
Oral
)
>
link
Many machine learning applications (credit scoring, fraud detection, etc.) use data in the tabular domains. Adversarial examples can be especially damaging for these applications. Yet, existing works on adversarial robustness mainly focus on machine-learning models in the image and text domains. We argue that due to the differences between tabular data and images or text, existing threat models are inappropriate for tabular domains. These models do not capture that cost can be more important than imperceptibility, nor that the adversary could ascribe different value to the utility obtained from deploying different adversarial examples. We show that due to these differences the attack and defense methods used for images and text cannot be directly applied to the tabular setup. We address these issues by proposing new cost and utility-aware threat models tailored to capabilities and constraints of attackers targeting tabular domains. We show that our approach is effective on two tabular datasets corresponding to applications for which attacks can have economic and social implications. |
🔗 |
-
|
Scoring Black-Box Models for Adversarial Robustness
(
Oral
)
>
link
Deep neural networks are susceptible to adversarial inputs and various methods have been proposed to defend these models against adversarial attacks under different perturbation models. The robustness of models to adversarial attacks has been analyzed by first constructing adversarial inputs for the model, and then testing the model performance on the constructed adversarial inputs. Most of these attacks require the model to be white-box, need access to data labels, and finding adversarial inputs can be computationally expensive. We propose a simple scoring method for black-box models which indicates their robustness to adversarial input. We show that adversarially more robust models have a smaller $l_1$-norm of \textsc{Lime} weights and sharper explanations.
|
🔗 |
-
|
When Can Linear Learners be Robust to Indiscriminate Poisoning Attacks?
(
Oral
)
>
link
We study indiscriminate poisoning for linear learners where an adversary injects a few crafted examples into the training data with the goal of forcing the induced model to incur higher test error. Inspired by the observation that linear learners on some datasets are able to resist the best known attacks even without any defenses, we further investigate whether datasets can be inherently robust to indiscriminate poisoning attacks for linear learners. For theoretical Gaussian distributions, we rigorously characterize the behavior of an optimal poisoning attack, defined as the poisoning strategy that attains the maximum risk of the induced model at a given poisoning budget. Our results prove that linear learners can indeed be robust to indiscriminate poisoning if the class-wise data distributions are well-separated with low variance and the size of the constraint set containing all permissible poisoning points is also small. These findings largely explain the drastic variation in empirical attack performance of the state-of-the-art poisoning attacks across benchmark datasets, making an important initial step towards understanding the underlying reasons some learning tasks are vulnerable to data poisoning attacks. |
🔗 |
-
|
Context-Aware Self-Adaptation for Domain Generalization
(
Oral
)
>
link
Domain generalization aims at developing suitable learning algorithms in source training domains such that the model learned can generalize well on a different unseen testing domain.We present a novel two-stage approach called Context-Aware Self-Adaptation (CASA) for domain generalization. CASA simulates an approximate meta-generalization scenario and incorporates a self-adaptation module to adjust pre-trained meta-source models to the meta-target domains while maintaining their predictive capability on the meta-source domains. The core concept of self-adaptation involves leveraging contextual information, such as the mean of mini-batch features, as domain knowledge to automatically adapt a model trained in the first stage to new contexts in the second stage.Lastly, we utilize an ensemble of multiple meta-source models to perform inference on the testing domain.Experimental results demonstrate that our proposed method achieves state-of-the-art performance on standard benchmarks. |
🔗 |
-
|
Label Noise: Correcting a Correction Loss
(
Oral
)
>
link
Training neural network classifiers on datasets with label noise poses a risk of overfitting them to the noisy labels. To address this issue, researchers have explored alternative loss functions that aim to be more robust. However, many of these alternatives are heuristic in nature and still vulnerable to overfitting or underfitting. In this work, we propose a more direct approach to tackling overfitting caused by label noise. We observe that the presence of label noise implies a lower bound on the noisy generalised risk. Building upon this observation, we propose imposing a lower bound on the empirical risk during training to mitigate overfitting. Our main contribution is providing theoretical results that yield explicit, easily computable bounds on the minimum achievable noisy risk for different loss functions. We empirically demonstrate that using these bounds significantly enhances robustness in various settings, with virtually no additional computational cost. |
🔗 |
-
|
Robust Semantic Segmentation: Strong Adversarial Attacks and Fast Training of Robust Models
(
Oral
)
>
link
While a large amount of work has focused on designing adversarial attacks against image classifiers, only a few methods exist to attack semantic segmentation models. We show that attacking segmentation models presents task-specific challenges, for which we propose novel solutions. Our final evaluation protocol outperforms existing methods, and shows that those can overestimate the robustness of the models. Additionally, so far adversarial training, the most successful way for obtaining robust image classifiers, could not be successfully applied to semantic segmentation. We argue that this is because the task to be learned is more challenging, and requires significantly higher computational effort than for image classification. As a remedy, we show that by taking advantage of recent advances in robust ImageNet classifiers, one can train adversarially robust segmentation models at limited computational cost by fine-tuning robust backbones. |
🔗 |
-
|
Model-tuning Via Prompts Makes NLP Models Adversarially Robust
(
Oral
)
>
link
In recent years, NLP practitioners have converged on the following practice:(i) import an off-the-shelf pretrained (masked) language model;(ii) append a multilayer perceptron atop the CLS token's hidden representation(with randomly initialized weights);and (iii) fine-tune the entire model on a downstream task (MLP-FT).This procedure has produced massive gains on standard NLP benchmarks,but these models remain brittle, even to mild adversarial perturbations,such as word-level synonym substitutions.In this work, we demonstrate surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP),an alternative method of adapting to downstream tasks.Rather than modifying the model (by appending an MLP head),MVP instead modifies the input (by appending a prompt template). Across three classification datasets,MVP improves performance against adversarial word-level synonym substitutions by an average of 8% over standard methods and even outperforms adversarial training-based state-of-art defenses by 3.5%.By combining MVP with adversarial training, we achieve further improvements in robust accuracywhile maintaining clean accuracy.Finally, we conduct ablations to investigate the mechanism underlying these gains.Notably, we find that the main causes of vulnerability of MLP-FT can be attributed to the misalignment between pre-training and fine-tuning tasks, and the randomly initialized MLP parameters. |
🔗 |
-
|
Which Models have Perceptually-Aligned Gradients? An Explanation via Off-Manifold Robustness
(
Oral
)
>
link
One of the remarkable properties of robust computer vision models is that their input-gradients are often aligned with human perception, referred to in the literature as perceptually-aligned gradients (PAGs). However, the underlying mechanisms behind these phenomena remain unknown. In this work, we provide a first explanation of PAGs via \emph{off-manifold robustness}, which states that models must be more robust off- the data manifold than they are on-manifold. We first demonstrate theoretically that off-manifold robustness leads input gradients to lie approximately on the data manifold, explaining their perceptual alignment, and then confirm the same empirically for models trained with robustness regularizers. Quantifying the perceptual alignment of model gradients via their similarity with the gradients of generative models, we show that off-manifold robustness correlates well with perceptual alignment. Finally, based on the levels of on- and off-manifold robustness, we identify three different regimes of robustness that affect both perceptual alignment and model accuracy: weak robustness, bayes-aligned robustness, and excessive robustness. |
🔗 |
-
|
Refined and Enriched Physics-based Captions For Unseen Dynamic Changes
(
Oral
)
>
link
Vision-Language models (VLMs), i.e., image-textpairs of CLIP, have boosted image-based DeepLearning (DL). Unseen images by transferring se-mantic knowledge from seen classes can be dealtwith with the help of language models pre-trainedonly with texts. Two-dimensional spatial rela-tionships and a higher semantic level have beenperformed. Moreover, Visual-Question-Answer(VQA) tools and open-vocabulary semantic seg-mentation provide us with more detailed scenedescriptions, i.e., qualitative texts, in captions.However, the capability of VLMs presents stillfar from that of human perception. This paperproposes PanopticCAP for refined and enrichedqualitative and quantitative captions to make themcloser to what human recognizes by combiningmultiple DLs and VLMs. In particular, captionswith physical scales and objects’ surface proper-ties are integrated by counting, visibility distance,and road conditions. Fine-tuned VLM models arealso used. An iteratively refined caption modelwith a new physics-based contrastive loss functionis used. Experimental results using images withadversarial weather conditions, i.e., rain, snow,fog, landslide, flooding, and traffic events, i.e.,accidents, outperform state-of-the-art DLs andVLMs. A higher semantic level in captions forreal-world scene descriptions is shown. |
🔗 |
-
|
Adaptive Certified Training: Towards Better Accuracy-Robustness Tradeoffs
(
Oral
)
>
link
As deep learning models continue to advance and are increasingly utilized in real-world systems, the issue of robustness remains a major challenge. Existing certified training methods produce models that achieve high provable robustness guarantees at certain perturbation levels. However, the main problem of such models is a dramatically low standard accuracy, i.e. accuracy on clean unperturbed data, that makes them impractical. In this work, we consider a more realistic perspective of maximizing the robustness of a model at certain levels of (high) standard accuracy. To this end, we propose a novel certified training method based on a key insight that training with adaptive certified radii helps to improve both the accuracy and robustness of the model, advancing state-of-the-art accuracy-robustness tradeoffs. We demonstrate the effectiveness of the proposed method on MNIST, CIFAR-10, and TinyImageNet datasets. Particularly, on CIFAR-10 and TinyImageNet, our method yields models with up to two times higher robustness, measured as an average certified radius of a test set, at the same levels of standard accuracy compared to baseline approaches. |
🔗 |
-
|
Illusory Attacks: Detectability Matters in Adversarial Attacks on Sequential Decision-Makers
(
Oral
)
>
link
Autonomous agents deployed in the real world need to be robust against adversarial attacks on sensory inputs. Robustifying agent policies requires anticipating the strongest attacks possible.We demonstrate that existing observation-space attacks on reinforcement learning agents have a common weakness: while effective, their lack of temporal consistency makes them \textit{detectable} using automated means or human inspection. Detectability is undesirable to adversaries as it may trigger security escalations.We introduce \textit{perfect illusory attacks}, a novel form of adversarial attack on sequential decision-makers that is both effective and provably \textit{statistically undetectable}. We then propose the more versatile \eattacks{}, which result in observation transitions that are consistent with the state-transition function of the adversary-free environment and can be learned end-to-end.Compared to existing attacks, we empirically find \eattacks{} to be significantly harder to detect with automated methods, and a small study with human subjects\footnote{IRB approval under reference xxxxxx/xxxxx} suggests they are similarly harder to detect for humans. We propose that undetectability should be a central concern in the study of adversarial attacks on mixed-autonomy settings. |
🔗 |
-
|
Certified Calibration: Bounding Worst-Case Calibration under Adversarial Attacks
(
Oral
)
>
link
Since neural classifiers are known to be sensitive to adversarial perturbations that alter their accuracy, certification methods have been developed to provide provable guarantees on the insensitivity of their predictions to such perturbations. However, in safety-critical applications, the frequentist interpretation of the confidence of a classifier (also known as model calibration) can be of utmost importance. This property can be measured via the Brier Score or the Expected Calibration Error. We show that attacks can significantly harm calibra- tion, and thus propose certified calibration providing worst-case bounds on calibration under adversarial perturbations. Specifically, we produce analytic bounds for the Brier score and approximate bounds via the solution of a mixed-integer program on the Expected Calibration Error. |
🔗 |
-
|
Don't trust your eyes: on the (un)reliability of feature visualizations
(
Oral
)
>
link
How do neural networks extract patterns from pixels? Feature visualizations attempt to answer this important question by visualizing highly activating patterns through optimization. Today, visualization methods form the foundation of our knowledge about the internal workings of neural networks, as a type of mechanistic interpretability. Here we ask: How reliable are feature visualizations? We start our investigation by developing network circuits that trick feature visualizations into showing arbitrary patterns that are completely disconnected from normal network behavior on natural input. We then provide evidence for a similar phenomenon occurring in standard, unmanipulated networks: feature visualizations are processed very differently from standard input, casting doubt on their ability to "explain" how neural networks process natural images. We underpin this empirical finding by theory proving that the set of functions that can be reliably understood by feature visualization is extremely small and does not include black-box neural networks. |
🔗 |
-
|
Classifier Robustness Enhancement Via Test-Time Transformation
(
Oral
)
>
link
It has been recently discovered that adversarially trained classifiers exhibit an intriguing property, referred to as perceptually aligned gradients (PAG). PAG implies that the gradients of such classifiers possess a meaningful structure, aligned with human perception. Adversarial training is currently the best-known way to achieve classification robustness under adversarial attacks. The PAG property, however, has yet to be leveraged for further improving classifier robustness. In this work, we introduce Classifier Robustness Enhancement Via Test-Time Transformation (TETRA) -- a novel defense method that utilizes PAG, enhancing the performance of trained robust classifiers. Our method operates in two phases. First, it modifies the input image via a designated targeted adversarial attack into each of the dataset's classes. Then, it classifies the input image based on the distance to each of the modified instances, with the assumption that the shortest distance relates to the true class. We show that the proposed method achieves state-of-the-art results and validate our claim through extensive experiments on a variety of defense methods, classifier architectures, and datasets. We also empirically demonstrate that TETRA can boost the accuracy of any differentiable adversarial training classifier across a variety of attacks, including ones unseen at training. Specifically, applying TETRA leads to substantial improvement of up to $+23\%$, $+20\%$, and $+26\%$ on CIFAR10, CIFAR100, and ImageNet, respectively.
|
🔗 |
-
|
CertViT: Certified Robustness of Pre-Trained Vision Transformers
(
Oral
)
>
link
Lipschitz bounded neural networks are certifiably robust and have a good trade-off between clean and certified accuracy. Existing Lipschitz bounding methods train from scratch and are limited to moderately sized networks (< 6M parameters). They require a fair amount of hyper-parameter tuning and are computationally prohibitive for large networks like Vision Transformers (5M to 660M parameters). Obtaining certified robustness of transformers is not feasible due to the non-scalability and inflexibility of the current methods. This work presents CertViT, a two-step proximal-projection method to achieve certified robustness from pre-trained weights. The proximal step tries to lower the Lipschitz bound and the projection step tries to maintain the clean accuracy of pre-trained weights. We show that CertViT networkshave better certified accuracy than state-of-the-art Lipschitz trained networks. We apply CertViT on several variants of pre-trained vision transformers and show adversarial robustness using standard attacks. Code : \url{https://github.com/sagarverma/transformer-lipschitz} |
🔗 |
-
|
Transferable Adversarial Perturbations between Self-Supervised Speech Recognition Models
(
Oral
)
>
link
A targeted adversarial attack produces audio samples that can force an Automatic Speech Recognition (ASR) system to output attacker-chosen text. To exploit ASR models in real-world, black-box settings, an adversary can leverage the \textit{transferability} property, i.e. that an adversarial sample produced for a proxy ASR can also fool a different remote ASR. Recent work has shown that transferability against large ASR models is extremely difficult. In this work, we show that modern ASR architectures, specifically ones based on Self-Supervised Learning, are uniquely affected by transferability. We successfully demonstrate this phenomenon by evaluating state-of-the-art self-supervised ASR models like Wav2Vec2, HuBERT, Data2Vec and WavLM. We show that with relatively low-level additive noise achieving a 30dB Signal-Noise Ratio, we can achieve target transferability with up to 80\% accuracy. We then use an ablation study to show that Self-Supervised learning is a major cause of that phenomenon. Our results present a dual interest: they show that modern ASR architectures are uniquely vulnerable to adversarial security threats, and they help understanding the specificities of SSL training paradigms. |
🔗 |
-
|
Tunable Dual-Objective GANs for Stable Training
(
Oral
)
>
link
In an effort to address the training instabilities of GANs, we introduce a class of dual-objective GANs with different value functions (objectives) for the generator (G) and discriminator (D). In particular, we model each objective using $\alpha$-loss, a tunable classification loss, to obtain $(\alpha_D,\alpha_G)$-GANs, parameterized by $(\alpha_D,\alpha_G)\in (0,\infty]^2$. For sufficiently large number of samples and capacities for G and D, we show that the resulting non-zero sum game simplifies to minimizing an $f$-divergence under appropriate conditions on $(\alpha_D,\alpha_G)$. We highlight the value of tuning $(\alpha_D,\alpha_G)$ in alleviating training instabilities for the synthetic 2D Gaussian mixture ring, the Celeb-A, and the LSUN Classroom datasets.
|
🔗 |
-
|
MLSMM: Machine Learning Security Maturity Model
(
Oral
)
>
link
Assessing the maturity of security practices during the development of Machine Learning (ML) based software components has not gotten as much attention as traditional software development.In this Blue Sky idea paper, we propose an initial Machine Learning Security Maturity Model (MLSMM) which organizes security practices along the ML-development lifecycle and, for each, establishes three levels of maturity. We envision MLSMM as a step towards closer collaboration between industry and academia. |
🔗 |
-
|
Adversarial Training Should Be Cast as a Non-Zero-Sum Game
(
Oral
)
>
link
One prominent approach toward resolving the adversarial vulnerability of deep neural networks is the two-player zero-sum paradigm of adversarial training, in which predictors are trained against adversarially-chosen perturbations of data. Despite the promise of this approach, algorithms based on this paradigm have not engendered sufficient levels of robustness, and suffer from pathological behavior like robust overfitting. To understand this shortcoming, we first show that the commonly used surrogate-based relaxation used in adversarial training algorithms voids all guarantees on the robustness of trained classifiers. The identification of this pitfall informs a novel non-zero-sum bilevel formulation of adversarial training, wherein each player optimizes a different objective function. Our formulation naturally yields a simple algorithmic framework that matches and in some cases outperforms state-of-the-art attacks, attains comparable levels of robustness to standard adversarial training algorithms, and does not suffer from robust overfitting. |
🔗 |
-
|
Unsupervised Adversarial Detection without Extra Model: Training Loss Should Change
(
Oral
)
>
link
Adversarial robustness poses a critical challenge in the deployment of deep learning models for real-world applications. Traditional approaches to adversarial training and supervised detection rely on prior knowledge of attack types and access to labeled training data, which is often impractical. Existing unsupervised adversarial detection methods identify whether the target model works properly, but they suffer from bad accuracies owing to the use of common cross-entropy training loss, which relies on unnecessary features and strengthens adversarial attacks. We propose new training losses to reduce useless features and the corresponding detection method without prior knowledge of adversarial attacks. The detection rate (true positive rate) against all given white-box attacks is above 93.9\% except for attacks without limits (DF($\infty$)), while the false positive rate is barely 2.5\%. The proposed method works well in all tested attack types and the false positive rates are even better than the methods good at certain types.
|
🔗 |
-
|
Stabilizing GNN for Fairness via Lipschitz Bounds
(
Oral
)
>
link
The Lipschitz bound, a technique from robust statistics, limits the maximum changes in output with respect to the input, considering associated irrelevant biased factors. It provides an efficient and provable method for examining the output stability of machine learning models without incurring additional computation costs. However, there has been no previous research investigating the Lipschitz bounds for Graph Neural Networks (GNNs), especially in the context of non-Euclidean data with inherent biases. This poses a challenge for constraining GNN output perturbations induced by input biases and ensuring fairness during training. This paper addresses this gap by formulating a Lipschitz bound for GNNs operating on attributed graphs, and analyzing how the Lipschitz constant can constrain output perturbations induced by biases for fairness training. The effectiveness of the Lipschitz bound is experimentally validated in limiting model output biases. Additionally, from a training dynamics perspective, we demonstrate how the theoretical Lipschitz bound can effectively guide GNN training to balance accuracy and fairness. |
🔗 |
-
|
Equal Long-term Benefit Rate: Adapting Static Fairness Notions to Sequential Decision Making
(
Oral
)
>
link
Decisions made by machine learning models may have lasting impacts over time, making long-term fairness a crucial consideration. It has been shown that when ignoring the long-term effect of decisions, naively imposing fairness criterion in static settings can actually exacerbate bias over time. To explicitly address biases in sequential decision-making, recent works formulate long-term fairness notions in Markov Decision Process (MDP) framework. They define the long-term bias to be the sum of static bias over each time step. However, we demonstrate that naively summing up the step-wise bias can cause a false sense of fairness since it fails to consider the importance difference of states during transition. In this work, we introduce a new long-term fairness notion called Equal Long-term Benefit Rate (ELBERT), which explicitly considers state importance and can preserve the semantics of static fairness principles in the sequential setting. Moreover, we show that the policy gradient of Long-term Benefit Rate can be analytically reduced to standard policy gradient. This makes standard policy optimization methods applicable for reducing the bias, leading to our proposed bias mitigation method ELBERT-PO. Experiments on three dynamical environments show that ELBERT-PO successfully reduces bias and maintains high utility. |
🔗 |
-
|
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
(
Oral
)
>
link
Recent advances in instruction-following large language models (LLMs) have led to dramatic improvements in a range of NLP tasks. Unfortunately, we find that the same improved capabilities amplify the dual-use risks for malicious purposes of these models. Dual-use is difficult to prevent as instruction-following capabilities now enable standard attacks from computer security. The capabilities of these instruction-following LLMs provide strong economic incentives for dual-use by malicious actors. In particular, weshow that instruction-following LLMs can produce targeted malicious content, including hate speech and scams, bypassing in-the-wild defenses implemented by LLM API vendors. Our analysis shows that this content can be generated economically and at cost likely lower than with human effort alone. Together, our findings suggest that LLMs will increasingly attract more sophisticated adversaries and attacks, and addressing these attacks may require new approaches to mitigations. |
🔗 |
-
|
Certifying Ensembles: A General Certification Theory with S-Lipschitzness
(
Oral
)
>
link
Improving and guaranteeing the robustness of deep learning models has been a topic of intense research. Ensembling, which combines several classifiers to provide a better model, has been shown to be beneficial for generalisation, uncertainty estimation, calibration, and mitigating the effects of concept drift. However, the impact of ensembling on certified robustness is less well understood. In this work, we generalise Lipschitz continuity by introducing S-Lipschitz classifiers, which we use to analyse the theoretical robustness of ensembles. Our results are precise conditions when ensembles of robust classifiers are more robust than any constituent classifier, as well as conditions when they are less robust. |
🔗 |
-
|
On the Limitations of Model Stealing with Uncertainty Quantification Models
(
Oral
)
>
link
Model stealing aims at inferring a victim model's functionality at a fraction of the original training cost.While the goal is clear, in practice the model's architecture, weight dimension, and original training data can not be determined exactly, leading to mutual uncertainty during stealing.In this work, we explicitly tackle this uncertainty by generating multiple possible networks and combining their predictions to improve the quality of the stolen model.For this, we compare five popular uncertainty quantification models in a model stealing task.Surprisingly, our results indicate that the considered models only lead to marginal improvements in terms of label agreement (i.e., fidelity) to the stolen model.To find the cause of this, we inspect the diversity of the model's prediction by looking at the prediction variance as a function of training iterations. We realize that during training, the models tend to have similar predictions, indicating that the network diversity we wanted to leverage using uncertainty quantification models is not (high) enough for improvements on the model stealing task. |
🔗 |
-
|
The Challenge of Differentially Private Screening Rules
(
Oral
)
>
link
Linear $L_1$-regularized models have remained one of the simplest and most effective tools in data science. Over the past decade, screening rules have risen in popularity as a way to reduce the runtime for producing the sparse regression weights of $L_1$ models. However, despite the increasing need of privacy-preserving models for data analysis, to the best of our knowledge, no differentially private screening rule exists. In this paper, we develop the first differentially private screening rule for linear and logistic regression. In doing so, we discover difficulties in the task of making a useful private screening rule due to the amount of noise added to ensure privacy. We provide theoretical arguments and experimental evidence that this difficulty arises from the screening step itself and not the private optimizer. Based on our results, we highlight that developing an effective private $L_1$ screening method is an open problem in the differential privacy literature.
|
🔗 |
-
|
PAC-Bayesian Adversarially Robust Generalization Bounds for Deep Neural Networks
(
Oral
)
>
link
Deep neural networks (DNNs) are vulnerable to adversarial attacks. It is found empirically that adversarially robust generalization is crucial in establishing defense algorithms against adversarial attacks. Therefore, it is interesting to study the theoretical guarantee of robust generalization. This paper focuses on PAC-Bayes analysis (Neyshabur et al., 2017). The main challenge lies in extending the key ingredient, which is a weight perturbation bound in standard settings, to the robust settings. Existing attempts heavily rely on additional strong assumptions, leading to loose bounds. In this paper, we address this issue and provide a spectrally-normalized robust generalization bound for DNNs. Our bound is at least as tight as the standard generalization bound, differing only by a factor of the perturbation strength $\epsilon$. In comparison to existing robust generalization bounds, our bound offers two significant advantages: 1) it does not depend on additional assumptions, and 2) it is considerably tighter. We present a framework that enables us to derive more general results. Specifically, we extend the main result to 1) adversarial robustness against general non-$\ell_p$ attacks, and 2) other neural network architectures, such as ResNet.
|
🔗 |
-
|
Sentiment Perception Adversarial Attacks on Neural Machine Translation Systems
(
Oral
)
>
link
With the advent of deep learning methods, Neural Machine Translation (NMT) systems have become increasingly powerful. However, deep learning based systems are susceptible to adversarial attacks, where imperceptible changes to the input can cause undesirable changes at the output of the system. To date there has been little work investigating adversarial attacks on sequence-to-sequence systems, such as NMT models. Previous work in NMT has examined attacks with the aim of introducing target phrases in the output sequence. In this work, adversarial attacks for NMT systems are explored from an output perception perspective. Thus the aim of an attack is to change the perception of the output sequence, without altering the perception of the input sequence. For example, an adversary may distort the sentiment of translated reviews to have an exaggerated positive sentiment. In practice it is challenging to run extensive human perception experiments, so a proxy deep-learning classifier applied to the NMT output is used to measure perception changes. Experiments demonstrate that the sentiment perception of NMT systems' output sequences can be changed significantly with small imperceptible changes to input sequences. |
🔗 |
-
|
(Almost) Provable Error Bounds Under Distribution Shift via Disagreement Discrepancy
(
Oral
)
>
link
We derive an (almost) guaranteed upper bound on the error of deep neural networks under distribution shift using unlabeled test data. Prior methods either give bounds that are vacuous in practice or give \emph{estimates} that are accurate on average but heavily underestimate error for a sizeable fraction of shifts. Our bound requires a simple, intuitive condition which is well justified by prior empirical works and holds in practice effectively 100\% of the time. The bound is inspired by $\mathcal{H}\Delta\mathcal{H}$-divergence but is easier to evaluate and substantially tighter, consistently providing non-vacuous guarantees. Estimating the bound requires optimizing one multiclass classifier to disagree with another, for which some prior works have used sub-optimal proxy losses; we devise a ``disagreement loss'' which is theoretically justified and performs better in practice. Across a wide range of benchmarks, our method gives valid error bounds while achieving average accuracy comparable to competitive estimation baselines.
|
🔗 |
-
|
Feature Partition Aggregation: A Fast Certified Defense Against a Union of $\ell_0$ Attacks
(
Oral
)
>
link
Sparse or $\ell_0$ adversarial attacks arbitrarily perturb an unknown subset of the features. $\ell_0$ robustness analysis is particularly well-suited for heterogeneous (tabular) data where features have different types or scales. State-of-the-art $\ell_0$ certified defenses are based on randomized smoothing and apply to evasion attacks only. This paper proposes feature partition aggregation (FPA) - a certified defense against the union of $\ell_0$ evasion, backdoor, and poisoning attacks. FPA generates its stronger robustness guarantees via an ensemble whose submodels are trained on disjoint feature sets. Compared to state-of-the-art $\ell_0$ defenses, FPA is up to $3,000\times$ faster and provides median robustness guarantees up to $4\times$ larger, meaning FPA provides the additional dimensions of robustness essentially for free.
|
🔗 |
-
|
Near Optimal Adversarial Attack on UCB Bandits
(
Oral
)
>
link
I study a stochastic multi-arm bandit problem where rewards are subject to adversarial corruption. I propose a novel attack strategy that manipulates a learner employing the upper-confidence-bound (UCB) algorithm into pulling some non-optimal target arm $T - o(T)$ times with a cumulative cost that scales as $\hat{O}(\sqrt{\log T})$, where $T$ is the number of rounds. I also prove the first lower bound on the cumulative attack cost. The lower bound matches the upper bound up to $O(\log \log T)$ factors, showing the proposed attack strategy to be near optimal.
|
🔗 |
-
|
Learning Exponential Families from Truncated Samples
(
Oral
)
>
link
Missing data problems have many manifestations across many scientific fields. A fundamental type of missing data problem arises when samples are \textit{truncated}, i.e., samples that lie in a subset of the support are not observed. Statistical estimation from truncated samples is a classical problem in statistics which dates back to Galton, Pearson, and Fisher. A recent line of work provides the first efficient estimation algorithms for the parameters of a Gaussian distribution and for linear regression with Gaussian noise.In this paper we generalize these results to log-concave exponential families. We provide an estimation algorithm that shows that \textit{extrapolation} is possible for a much larger class of distributions while it maintains a polynomial sample and time complexity. Our algorithm is based on Projected Stochastic Gradient Descent and is not only applicable in a more general setting but is also simpler and more efficient than recent algorithms. Our work also has interesting implications for learning general log-concave distributions and sampling given only access to truncated data. |
🔗 |
-
|
Identifying Adversarially Attackable and Robust Samples
(
Oral
)
>
link
Adversarial attacks insert small, imperceptible perturbations to input samples that cause large, undesired changes to the output of deep learning models. Despite extensive research on generating adversarial attacks and building defense systems, there has been limited research on understanding adversarial attacks from an input-data perspective. This work introduces the notion of sample attackability, where we aim to identify samples that are most susceptible to adversarial attacks (attackable samples) and conversely also identify the least susceptible samples (robust samples). We propose a deep-learning-based detector to identify the adversarially attackable and robust samples in an unseen dataset for an unseen target model. Experiments on standard image classification datasets enables us to assess the portability of the deep attackability detector across a range of architectures. We find that the deep attackability detector performs better than simple model uncertainty-based measures for identifying the attackable/robust samples. This suggests that uncertainty is an inadequate proxy for measuring sample distance to a decision boundary. In addition to better understanding adversarial attack theory, it is found that the ability to identify the adversarially attackable and robust samples has implications for improving the efficiency of sample-selection tasks. |
🔗 |
-
|
Toward Testing Deep Learning Library via Model Fuzzing
(
Oral
)
>
link
The irreversible tendency to empower the industry with deep learning (DL) capabilities is rising new security challenges. A DL-based system will be vulnerable to serious attacks if the vulnerabilities of underlying DL frameworks (e.g. Tensorflow, Pytorch) are exploited. It is crucial to test the DL framework to bridge the gap between security requirements and deployment urgency. A specifically-designed model fuzzing method will be used in my research project to address this challenge. Firstly, we generate diverse models to test libraries implementations in the training and prediction phases using the optimized mutation strategies. Furthermore, we consider the seed performance score including coverage, discovery time and mutation numbers when selecting model seeds with higher priority. Our algorithm also selects the optimal mutation strategy based on heuristics to expand inconsistencies. Finally, to evaluate the effectiveness of our scheme, we implement our test framework and conduct the experiment on Pytorch, Tensorflow and Theano. The preliminary results demonstrate that this is a promising direction and worth to be further research. |
🔗 |
-
|
Adversarial Attacks and Defenses in Explainable Artificial Intelligence: A Survey
(
Oral
)
>
link
Explainable artificial intelligence (XAI) methods are portrayed as a remedy for debugging and trusting statistical and deep learning models, as well as interpreting their predictions. However, recent advances in adversarial machine learning highlight the limitations and vulnerabilities of state-of-the-art explanations, putting their security and trustworthiness into question. The possibility of manipulating, fooling or fairwashing evidence of the model's reasoning has detrimental consequences when applied in high-stakes decision-making and knowledge discovery. This concise survey of over 50 papers summarizes research concerning adversarial attacks on explanations of machine learning models, as well as fairness metrics. We discuss how to defend against attacks and design robust interpretation methods. We contribute a list of existing insecurities in XAI and outline the emerging research directions in adversarial XAI (AdvXAI). |
🔗 |
-
|
Sharpness-Aware Minimization Alone can Improve Adversarial Robustness
(
Oral
)
>
link
Sharpness-Aware Minimization (SAM) is an effective method for improving generalization ability by regularizing loss sharpness. In this paper, we explore SAM in the context of adversarial robustness. We find that using only SAM can achieve superior adversarial robustness without sacrificing clean accuracy compared to standard training, which is an unexpected benefit. We also discuss the relation between SAM and adversarial training (AT), a popular method for improving the adversarial robustness of DNNs. In particular, we show that SAM and AT differ in terms of perturbation strength, leading to different accuracy and robustness trade-offs. We provide theoretical evidence for these claims in a simplified model. Finally, while AT suffers from decreased clean accuracy and computational overhead, we suggest that SAM can be regarded as a lightweight substitute for AT under certain requirements. Code is available at https://github.com/weizeming/SAM_AT. |
🔗 |
-
|
On feasibility of intent obfuscating attacks
(
Oral
)
>
link
Intent obfuscation is a common tactic in adversarial situations, enabling the attacker to both manipulate the target system and avoid culpability. Surprisingly, it has rarely been implemented in adversarial attacks on machine learning systems. We are the first to propose incorporating intent obfuscation in generating adversarial examples for object detectors: by perturbing another non-overlapping object to disrupt the target object, the attacker hides their intended target. We conduct a randomized experiment on 5 prominent detectors---YOLOv3, SSD, RetinaNet, Faster R-CNN, and Cascade R-CNN---using both targeted and untargeted attacks and achieve success on all models and attacks. We analyze the success factors characterizing intent obfuscating attacks, including target object confidence and perturb object sizes. We then demonstrate that the attacker can exploit these success factors to increase success rates for all models and attacks. Finally, we discuss known defenses and legal repercussions. |
🔗 |
-
|
Adversarial Training with Generated Data in High-Dimensional Regression: An Asymptotic Study
(
Oral
)
>
link
In recent years, studies such as \cite{carmon2019unlabeled,gowal2021improving,xing2022artificial} have demonstrated that incorporating additional real or generated data with pseudo-labels can enhance adversarial training through a two-stage training approach. In this paper, we perform a theoretical analysis of the asymptotic behavior of this method in high-dimensional linear regression. While a double-descent phenomenon can be observed in ridgeless training, with an appropriate $\mathcal{L}_2$ regularization, the two-stage adversarial training achieves a better performance. Finally, we derive a shortcut cross-validation formula specifically tailored for the two-stage training method.
|
🔗 |