Maximum Likelihood Reinforcement Learning
Fahim Tajwar ⋅ Guanning Zeng ⋅ Yueer Zhou ⋅ Yuda Song ⋅ Daman Arora ⋅ Yiding Jiang ⋅ Jeff Schneider ⋅ Russ Salakhutdinov ⋅ Haiwen Feng ⋅ Andrea Zanette
Abstract
Maximum likelihood is fundamental to supervised learning but it cannot be directly applied in correctness-based problems with non-differentiable sampling. In these settings, reinforcement learning (RL) is typically used to maximize expected reward. We show that for binary correctness tasks, expected-reward RL is a first-order approximation of the maximum likelihood objective, yielding vanishing learning signal on low-success inputs. We introduce **Maximum Likelihood Reinforcement Learning (MaxRL)**, a compute-indexed family of sampling-based objectives derived from a pass@k expansion of the likelihood, which interpolates between standard RL and exact maximum likelihood as compute increases. MaxRL admits a simple unbiased policy-gradient estimator whose optimized objective improves with additional compute. Across multiple domains, MaxRL consistently outperforms standard RL and GRPO, achieving higher $pass@1$ and substantially improved $pass@k$.
Successful Page Load