Session
Adversarial Examples
Theoretically Principled Trade-off between Robustness and Accuracy
Hongyang Zhang · Yaodong Yu · Jiantao Jiao · Eric Xing · Laurent El Ghaoui · Michael Jordan
We identify a trade-off between robustness and accuracy that serves as a guiding principle in the design of defenses against adversarial examples. Although the problem has been widely studied empirically, much remains unknown concerning the theory underlying this trade-off. In this work, we quantify the trade-off in terms of the gap between the risk for adversarial examples and the risk for non-adversarial examples. The challenge is to provide tight bounds on this quantity in terms of a surrogate loss. We give an optimal upper bound on this quantity in terms of classification-calibrated loss, which matches the lower bound in the worst case. Inspired by our theoretical analysis, we also design a new defense method, TRADES, to trade adversarial robustness off against accuracy. Our proposed algorithm performs well experimentally in real-world datasets. The methodology is the foundation of our entry to the adversarial competition of a 2018 conference in which we won the 1st place out of ~2,000 submissions, surpassing the runner-up approach by 11.41% in terms of mean L_2 perturbation distance.
The Odds are Odd: A Statistical Test for Detecting Adversarial Examples
Kevin Roth · Yannic Kilcher · Thomas Hofmann
We investigate conditions under which test statistics exist that can reliably detect examples, which have been adversarially manipulated in a white-box attack. These statistics can be easily computed and calibrated by randomly corrupting inputs. They exploit certain anomalies that adversarial attacks introduce, in particular if they follow the paradigm of choosing perturbations optimally under p-norm constraints. Access to the log-odds is the only requirement to defend models. We justify our approach empirically, but also provide conditions under which detectability via the suggested test statistics is guaranteed to be effective. In our experiments, we show that it is even possible to correct test time predictions for adversarial attacks with high accuracy.
ME-Net: Towards Effective Adversarial Robustness with Matrix Estimation
Yuzhe Yang · GUO ZHANG · Zhi Xu · Dina Katabi
Deep neural networks are vulnerable to adversarial attacks. The literature is rich with algorithms that can easily craft successful adversarial examples. In contrast, the performance of defense techniques still lags behind. This paper proposes ME-Net, a defense method that leverages matrix estimation (ME). In ME-Net, images are preprocessed using two steps: first pixels are randomly dropped from the image; then, the image is reconstructed using ME. We show that this process destroys the adversarial structure of the noise, while re-enforcing the global structure in the original image. Since humans typically rely on such global structures in classifying images, the process makes the network mode compatible with human perception. We conduct comprehensive experiments on prevailing benchmarks such as MNIST, CIFAR-10, SVHN, and Tiny-ImageNet. Comparing ME-Net with state-of-the-art defense mechanisms shows that ME-Net consistently outperforms prior techniques, improving robustness against both black-box and white-box attacks.
Certified Adversarial Robustness via Randomized Smoothing
Jeremy Cohen · Elan Rosenfeld · Zico Kolter
Recent work has used randomization to create classifiers that are provably robust to adversarial perturbations with small L2 norm. However, existing guarantees for such classifiers are unnecessarily loose. In this work we provide the first tight analysis for these "randomized smoothing" classifiers. We then use the method to train an ImageNet classifier with e.g. a provable top-1 accuracy of 59% under adversarial perturbations with L2 norm less than 57/255. No other provable adversarial defense has been shown to be feasible on ImageNet. On the smaller-scale datasets where alternative approaches are viable, randomized smoothing outperforms all alternatives by a large margin. While our specific method can only certify robustness in L2 norm, the empirical success of the approach suggests that provable methods based on randomization are a promising direction for future research into adversarially robust classification.
Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition
Yao Qin · Nicholas Carlini · Garrison Cottrell · Ian Goodfellow · Colin Raffel
Adversarial examples are inputs to machine learning models designed by an adversary to cause an incorrect output. So far, adversarial examples have been studied most extensively in the image domain. In this domain, adversarial examples can be constructed by imperceptibly modifying images to cause misclassification, and are practical in the physical world. In contrast, current targeted adversarial examples on speech recognition systems have neither of these properties: humans can easily identify the adversarial perturbations, and they are not effective when played over-the-air. This paper makes progress on both of these fronts. First, we develop effectively imperceptible audio adversarial examples (verified through a human study) by leveraging the psychoacoustic principle of auditory masking, while retaining 100% targeted success rate on arbitrary full-sentence targets. Then, we make progress towards physical-world audio adversarial examples by constructing perturbations which remain effective even after applying highly-realistic simulated environmental distortions.
Parsimonious Black-Box Adversarial Attacks via Efficient Combinatorial Optimization
Seungyong Moon · Gaon An · Hyun Oh Song
Solving for adversarial examples with projected gradient descent has been demonstrated to be highly effective in fooling the neural network based classifiers. However, in the black-box setting, the attacker is limited only to the query access to the network and solving for a successful adversarial example becomes much more difficult. To this end, recent methods aim at estimating the true gradient signal based on the input queries but at the cost of excessive queries.
We propose an efficient discrete surrogate to the optimization problem which does not require estimating the gradient and consequently becomes free of the first order update hyperparameters to tune. Our experiments on Cifar-10 and ImageNet show the state of the art black-box attack performance with significant reduction in the required queries compared to a number of recently proposed methods.
Wasserstein Adversarial Examples via Projected Sinkhorn Iterations
Eric Wong · Frank R Schmidt · Zico Kolter
A rapidly growing area of work has studied the existence of adversarial examples, datapoints which have been perturbed to fool a classifier, but the vast majority of these works have focused primarily on threat models defined by L-p norm-bounded perturbations. In this paper, we propose a new threat model for adversarial attacks based on the Wasserstein distance. In the image classification setting, such distances measure the cost of moving pixel mass, which naturally cover "standard" image manipulations such as scaling, rotation, translation, and distortion (and can potentially be applied to other settings as well). To generate Wasserstein adversarial examples, we develop a procedure for projecting onto the Wasserstein ball, based upon a modified version of the Sinkhorn iteration. The resulting algorithm can successfully attack image classification models, bringing traditional CIFAR10 models down to 3% accuracy within a Wasserstein ball with radius 0.1 (i.e., moving 10% of the image mass 1 pixel), and we demonstrate that PGD-based adversarial training can improve this adversarial accuracy to 76%. In total, this work opens up a new direction of study in adversarial robustness, more formally considering convex metrics that accurately capture the invariances that we typically believe should exist in classifiers.
Transferable Clean-Label Poisoning Attacks on Deep Neural Nets
Chen Zhu · W. Ronny Huang · Hengduo Li · Gavin Taylor · Christoph Studer · Tom Goldstein
In this paper, we explore the clean-label poisoning attacks on neural networks with no access to neither the networks' output nor its parameters. We deal with the case of transfer learning, where the network is initialized from a pre-trained model on a certain dataset and only its last layer is re-trained on the targeted dataset. The task is to make the re-trained model classify the target image into a target class. To achieve this goal, we generate multiple poison images from the target class by adding small perturbations on the clean images. These poison images form a convex hull of the target image in the feature space, with guarantees the target image to be mis-classified when the requirement is perfectly satisfied.
NATTACK: Learning the Distributions of Adversarial Examples for an Improved Black-Box Attack on Deep Neural Networks
Yandong li · Lijun Li · Liqiang Wang · Tong Zhang · Boqing Gong
Powerful adversarial attack methods are vital for understanding how to construct robust deep neural networks (DNNs) and for thoroughly testing defense techniques. In this paper, we propose a black-box adversarial attack algorithm that can defeat both vanilla DNNs and those generated by various defense techniques developed recently. Instead of searching for an "optimal" adversarial example for a benign input to a targeted DNN, our algorithm finds a probability density distribution over a small region centered around the input, such that a sample drawn from this distribution is likely an adversarial example, without the need of accessing the DNN's internal layers or weights. Our approach is universal as it can successfully attack different neural networks by a single algorithm. It is also strong; according to the testing against 2 vanilla DNNs and 13 defended ones, it outperforms state-of-the-art black-box or white-box attack methods for most test cases. Additionally, our results reveal that adversarial training remains one of the best defense techniques, and the adversarial examples are not as transferable across defended DNNs as them across vanilla DNNs.
Simple Black-box Adversarial Attacks
Chuan Guo · Jacob Gardner · Yurong You · Andrew Wilson · Kilian Weinberger
We propose an intriguingly simple method for the construction of adversarial images in the black-box setting. In constrast to the white-box scenario, constructing black-box adversarial images has the additional constraint on query budget, and efficient attacks remain an open problem to date. With only the mild assumption of requiring continuous-valued confidence scores, our highly query-efficient algorithm utilizes the following simple iterative principle: we randomly sample a vector from a predefined orthonormal basis and either add or subtract it to the target image. Despite its simplicity, the proposed method can be used for both untargeted and targeted attacks -- resulting in previously unprecedented query efficiency in both settings. We demonstrate the efficacy and efficiency of our algorithm on several real world settings including the Google Cloud Vision API. We argue that our proposed algorithm should serve as a strong baseline for future black-box attacks, in particular because it is extremely fast and its implementation requires less than 20 lines of PyTorch code.