CausalXRL: Explainable Reinforcement Learning through Causal Graph Reasoning
Abstract
Reinforcement learning is a powerful paradigm for training autonomous agents and has achieved impressive performance in complex environments. However, this success often comes at the cost of interpretability, diminishing trust and complicating efforts to debug and improve agent behavior. To address these challenges, we introduce CausalXRL, a novel framework for explainable reinforcement learning (XRL). A key feature of CausalXRL is its use of causal graph reasoning, which provides transparent, structured, multi-level explanations of agent decision-making. We validate CausalXRL through comprehensive case studies and a two-part evaluation: (1) a quantitative analysis of agent performance and explanation fidelity in benchmark RL environments, and (2) a qualitative expert study assessing interpretability in a real-time strategy (RTS) game. Results show that CausalXRL enhances human understanding and diagnostic insight in multi-agent scenarios, without compromising task performance. By enabling human operators to interrogate RL agents through causal models, CausalXRL supports alignment by making behavior transparent and auditable.