Poster

Off-Policy Actor-Critic for Adversarial Observation Robustness: Virtual Alternative Training via Symmetric Policy Evaluation

Kosuke Nakanishi ⋅ Akihiro Kubo ⋅ Yuji Yasui ⋅ Shin Ishii

2025 Poster

Project Page [ Poster] [ OpenReview]

Abstract

Recently, robust reinforcement learning (RL) methods designed to handle adversarial input observations have received significant attention, motivated by RL's inherent vulnerabilities. While existing approaches have demonstrated reasonable success, addressing worst-case scenarios over long time horizons requires both minimizing the agent's cumulative rewards for adversaries and training agents to counteract them through alternating learning. However, this process introduces mutual dependencies between the agent and the adversary, making interactions with the environment inefficient and hindering the development of off-policy methods.In this work, we propose a novel off-policy method that eliminates the need for additional environmental interactions by reformulating adversarial learning as a soft-constrained optimization problem. Our approach is theoretically supported by the symmetric property of policy evaluation between the agent and the adversary.The implementation is available at https://github.com/nakanakakosuke/VALT_SAC.

Lay Summary

Reinforcement learning (RL) is a machine learning approach where an agent learns to make decisions through trial and error. However, in real-world settings, the information the agent receives can be noisy or unreliable—for example, due to sensor errors or unexpected changes in the environment. To make agents more robust in such situations, researchers have traditionally trained them alongside an adversary (or perturbation generator) that introduces challenging inputs.In this work, we show that agents can instead $\textit{imagine}$ such challenging scenarios during training—without explicitly training an adversary. Based on this idea, we propose a new method that enables agents to learn efficiently from previously collected data, without requiring additional interaction with the environment. Our approach is grounded in solid theory and demonstrates strong performance in experiments. This makes RL more practical and reliable for real-world applications.

Video

Chat is not available.