Verifying Meta-Awareness via Predictive Rewards in Reasoning Models
Yoonjeon Kim ⋅ Doohyuk Jang ⋅ Eunho Yang
Abstract
Recent research on reasoning models explores the meta-awareness of language models, including their ability to determine optimal thinking duration, recognize knowledge boundaries, and structure concept-level thinking. While current large reasoning models depend solely on answer-based verification, we show that adding meta-awareness objectives leads to significant performance gains over models without such meta-knowledge. **MAPR** utilizes a self-generated task of predicting rollout statistics - specifically length, pass-rate, and concepts used - allowing for verification against the actual statistics. Furthermore, by leveraging this self-predictive capability, the model can regulate its reasoning behavior by i) filtering out trivial or unsolvable prompts, ii) reducing lengthy generations that tend to be incorrect, and iii) generating hints relevant to the problem. The results are inspiring: **MAPR** yields significant improvements in both accuracy and training efficiency on various reasoning benchmarks. More specifically, our method can speed up GRPO training by over 1.28$\times$ to reach the same performance, and achieve 83.18\% gain in accuracy on AIME25, and a 13.04\% average gain over six mathematics benchmarks.
Successful Page Load