Prioritized Model Experience Replay
Abstract
Model-based reinforcement learning (MBRL) improves sample efficiency by leveraging learned dynamics models, but often suffers from unstable training due to dynamics model learning mismatch: models are trained on data from historical policies while being queried under the continually updated current policy. This mismatch can cause policy-relevant local model error to remain large even as global prediction error decreases, leading to oscillatory updates. We present a finite-horizon performance analysis that decomposes the policy performance gap into global model error, policy-induced distribution shift, and historical policy mixture effects, showing that minimizing global error alone is insufficient for stable optimization. Motivated by this analysis, we propose Prioritized Model Experience Replay (PMER), a lightweight replay mechanism that prioritizes high-error transitions during dynamics model training. PMER implicitly emphasizes policy-relevant regions without explicit policy distance estimation and integrates seamlessly into Dyna-style MBRL frameworks. Experiments on MuJoCo benchmarks demonstrate improved stability, faster convergence, and higher sample efficiency.