Poster
in
Workshop: 2nd AI for Math Workshop @ ICML 2025

CoDaPO: Confidence and Difficulty-Adaptive Policy Optimization for Post-Training Language Models

Zhanke Zhou ⋅ Xiangyu Lu ⋅ Chentao Cao ⋅ Brando Miranda ⋅ Tongliang Liu ⋅ Bo Han ⋅ Sanmi Koyejo

Project Page [ OpenReview]

Abstract

Large language models (LLMs) increasingly rely on reinforcement learning (RL) post-training to improve step-by-step reasoning. Therein, Group Relative Policy Optimization (GRPO) emerges as a prevailing approach that avoids the need for fully supervised traces. However, GRPO can struggle with high-difficulty tasks, overfit to easy problems, and suffer from sensitivity to reward design. To diagnose these weaknesses, we introduce a general analysis framework that maps training trajectories onto an advantage-confidence plane, revealing three critical phenomena: (1) advantage contraction: reward-normalized advantages collapse as accuracy improves; (2) confidence saturation: policies become overconfident even on incorrect outputs; and (3) hierarchical convergence: easy problems are quickly mastered while harder ones lag. Based on these insights, we propose CoDaPO (Confidence- and Difficulty-Adaptive Policy Optimization), an RL algorithm that adopts correctness-based reward and advantage reweighting w.r.t. confidence and difficulty. Experiments on several benchmarks demonstrate that CoDaPO achieves higher reasoning accuracy and better generalization than existing RL approaches.

Chat is not available.