Poster
in
Workshop: Next Generation of AI Safety
Can Go AIs be adversarially robust?
Tom Tseng · Euan McLean · Kellin Pelrine · Tony Wang · Adam Gleave
Keywords: [ Go ] [ Reinforcement Learning ] [ alignment ] [ robustness ]
Prior work found that superhuman Go AIs such as KataGo are vulnerable to opponents playing simple adversarial strategies. This shows that superhuman average-case capabilities may not translate to satisfactory worst-case robustness. However, Go AIs were never designed with security in mind, raising the question: can simple defenses make KataGo robust? In this paper, we test three natural defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find these defenses protect against previously discovered attacks, but we uncover several qualitatively distinct adversarial strategies that beat our defended agents. Our results suggest that achieving robustness is challenging, even in narrow domains such as Go. Our code is available at [anonymized].