RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization
Kai Fukazawa ⋅ Kunal Mundada ⋅ Iman Soltani
Abstract
In safety-critical domains where online data collection is infeasible, offline reinforcement learning (RL) is attractive only if policies achieve high returns without catastrophic lower-tail risk. Prior work on risk-averse offline RL achieves safety at the cost of value- or model-based pessimism, and restricted policy classes that limit policy expressiveness, whereas diffusion/flow-based expressive generative policies have largely been used in risk-neutral settings. We introduce **Risk-Aware Multimodal Actor-Critic (RAMAC)**, a simple, modular, model-free framework that couples an expressive generative actor (e.g., diffusion/flow) with a distributional critic and optimizes a composite objective that combines Conditional Value-at-Risk (CVaR) with behavioral cloning (BC), enabling risk-sensitive learning in complex multimodal scenarios. Since out-of-distribution (OOD) actions are a major driver of catastrophic failures in offline RL, we further provide an objective-level analysis showing that controlling behavior divergence via BC suppresses OOD actions and stabilizes CVaR. Instantiating RAMAC with a diffusion actor, we illustrate these insights on a 2-D risky bandit and evaluate on Stochastic-D4RL, observing consistent gains in CVaR$_{0.1}$ while maintaining strong returns.
Successful Page Load