Adversarial training and its variants have become the de facto standard for combatting against adversarial attacks in machine learning models. In this paper, we seek insight into how an adversarially trained deep neural network (DNN) differs from its naturally trained counterpart, focusing on the role of different layers in the network. To this end, we develop a novel method to measure and attribute adversarial effectiveness to each layer, based on partial adversarial training. We find that, while all layers in an adversarially trained network contribute to robustness, earlier layers play a more crucial role. These conclusions are corroborated by a method of tracking the impact of adversarial perturbations as they flow across the network layers, based on the statistics of ”perturbation-to-signal ratios” across layers. While adversarial training results in black box DNNs which can only provide empirical assurances of robustness, our findings imply that the search for architectural principles in training and inference for building in robustness in an interpretable manner could start with the early layers of a DNN.
Can Bakiskan (University of California, Santa Barbara)
Metehan Cekic (University of California, Santa Barbara)
I am currently a Ph.D. candidate, working with Prof. Upamanyu Madhow in the Electrical and Computer Engineering department at UCSB. Through my research experience, I have developed an interest in Deep Learning and its applications. I received my B.S. degrees in electrical & electronics engineering and physics from Bogazici University, Istanbul, Turkey in 2017.
Upamanyu Madhow (University of California, Santa Barbara)
More from the Same Authors
2022 : Layerwise Hebbian/anti-Hebbian (HaH) Learning In Deep Networks: A Neuro-inspired Approach To Robustness »
Metehan Cekic · Can Bakiskan · Upamanyu Madhow
2022 : Dynamic Positive Reinforcement For Long-Term Fairness »
Bhagyashree Puranik · Upamanyu Madhow · Ramtin Pedarsani