Reward-Preserving Counterfactual State Editing for Offline Reinforcement Learning
Abstract
Transformer sequence models such as Decision Transformer can learn strong offline policies from logged trajectories, but they can suffer from causal confusion: reliance on spurious correlations that predict reward in the data but do not reflect the true causal mechanisms of the environment. We propose CSET (Counterfactual State Editing Transformer), which improves robustness in strictly offline reinforcement learning without learning environment transition dynamics. CSET first fits a causal reward model as a conditional variational autoencoder to infer a posterior over reward disturbances for each transition. Conditioning on the factual action and a sampled disturbance, a counterfactual state generator proposes a minimally edited state whose predicted reward matches the factual reward; a normalized move-band constraint and an acceptance gate enforce state plausibility and reward consistency. We then augment trajectories by replacing only the observation token with the edited state while keeping the next observation factual, so the policy is not trained on synthetic successor transitions. On the model side, CSET uses a causally structured hybrid transformer: separate convolutional encoders process return-to-go, state, and action streams for local temporal structure, and a final attention block is softly supervised so action prediction focuses on its direct causal parents (state and return-to-go). Experiments on D4RL locomotion and AntMaze tasks and on offline recommendation benchmarks show consistent gains over transformer baselines and substantially improved robustness to injected spurious distractors.