Poster

What Reward Structure Enables Efficient Sparse-Reward RL? A Proof-of-Concept with Policy-Aware Matrix Completion

Ibne Farabi Shihab ⋅ SANJEDA AKTER ⋅ Anuj Sharma

Abstract

Sparse-reward reinforcement learning typically focuses on exploration, but we ask: can structural assumptions about reward functions themselves accelerate learning? We introduce Policy-Aware Matrix Completion (PAMC), which exploits low-rank structure in reward matrices while correcting for policy-induced sampling bias. PAMC combines three key components: a low-rank plus sparse reward model, inverse propensity weighting to handle Missing-Not-At-Random (MNAR) data, and confidence-gated abstention that falls back to intrinsic exploration when uncertain. We provide finite-sample theory showing that completion error scales as $O(\sigma\sqrt{r(|\mathcal{S}|+|\mathcal{A}|)/\text{ESS}})$ where ESS is the effective sample size under policy overlap $\kappa$. PAMC achieves strong empirical results at 10M steps (a sample-efficiency comparison): 4100$\pm$250 return vs. 200$\pm$50 for DrQ-v2 on Montezuma's Revenge, 78\% vs. 65\% success rate on MetaWorld-50, and 15\% improvement over CQL on D4RL datasets. The method maintains 8\% computational overhead while providing calibrated confidence intervals (95\% empirical coverage). When structural assumptions are violated, PAMC gracefully degrades through increased abstention rather than catastrophic failure. Our approach demonstrates that reward structure exploitation can complement traditional exploration methods in sparse-reward domains.