Reflect-then-Correct: Rebalancing Task Optimization for Generalizable Meta-Reinforcement Learning via Distributional Value Error Reduction
Abstract
Meta-Reinforcement Learning (Meta-RL) faces significant challenges in non-parametric settings, where vastly different return scales across diverse tasks cause severe gradient interference. Existing categorical solutions attempt to normalize these scales but often fail due to rigid discretization and quantization errors. To address this, we propose Reflect-then-Correct (RTC), a framework that models meta-values using Sinkhorn divergence. By treating distributions as adaptive floating particles, RTC achieves a geometry-aware alignment of distinct meta-task structures. However, while Sinkhorn updates harmonize gradients, they introduce statistical bias via sampling estimation. RTC overcomes this by ''reflecting'' on the temporal accumulation of Bellman inconsistencies through a recursive error model and ''correcting'' the optimization via adaptive importance weights that prioritize transitions critical for accuracy. We provide theoretical guarantees for this reweighting strategy and demonstrate that RTC outperforms existing baselines on the challenging Meta-World ML-10 and ML-45 benchmarks.