Poster
in
Workshop: 2nd AI for Math Workshop @ ICML 2025

Beyond Accuracy: A Policy Gradient Reweighting Approach for Pass@K Maximization in LLMs

Sadegh Mahdavi ⋅ Muchen Li ⋅ Kaiwen Liu ⋅ Renjie Liao ⋅ Christos Thrampoulidis

Project Page [ OpenReview]

Abstract

Recent works have demonstrated that reinforcement learning (RL) can substantially improve large language models (LLMs) for mathematical reasoning. However, most RL fine-tuning strategies optimize for single-sample accuracy (Pass@1), despite many practical applications relying on multi-sample inference (Pass@K). In this paper, we derive a principled RL objective that directly maximizes the expected Pass@K metric. Our approach formulates Pass@K maximization as a policy gradient objective, where harder examples (i.e., those with lower probability of success) are emphasized more during training. We connect our objective to Focal Loss from supervised learning and demonstrate its effectiveness across both Rejection-Fine-Tuning and GRPO algorithms. Experiments on mathematical benchmarks and synthetic arithmetic benchmarks show improvements in Pass@K over standard RL baselines. Our method provides a simple yet effective way to better align RL fine-tuning with the practical usage of LLMs.

Chat is not available.