Poster
in
Workshop: RLxF: RL from World Feedback Fri, Jul 10, 2026 • 12:00 AM – 1:00 AM PDT

When The Verifier Is The Only Trustworthy Feedback Source: A Self-Teacher RLVR Pilot, A Confounded Logprob Extension, And Four Corrective Probes

Ethan Y Wang ⋅ Aayan Alwani

Project Page

Abstract

Self-teacher reinforcement learning with verifiable rewards (RLVR) is a recurring escape hatch from the post-Yue limit on what RL post-training adds to base language models. The proposal class rests on an empirical premise: the base is a stronger discriminator of valid reasoning chains than it is a generator of them, the generative AI paradox of West et al. We pre-registered a behavioral test of the asymmetry on Qwen2.5-1.5B base across a 51-problem hard subset of AIME 2018-2023 with pass@1024 of zero, with a 4-condition design and a locked decision rule. The pre-registered test fails on every sub-criterion: the locked pivot fired and a planned 7B headline run was canceled. We then extended the failure into a 17-measurement post-hoc claim across 1.5B-72B base, Instruct, and R1-distilled checkpoints using a logprob protocol that scored the probability of a fixed boxed answer string. The same near-zero gap appeared everywhere; we framed this as a discrimination ceiling general to the model class. A reviewer subsequently identified that the corruption protocol preserved the visible final answer, so the test could not see chain-content sensitivity by construction. We report four corrective probes: answer-relabeled corruption, an explicit YES-NO verdict test, a prompt-format ablation, and a per-token surprise probe. The model has chain-content sensitivity at every confirmed scale; Instruct-tuning multiplies verdict discrimination five-fold; the per-token signal is locally present at the corruption site on every problem. The post-hoc universal-ceiling reading is partially refuted. The pre-registered behavioral test still fails on every sub-criterion, so methods that exploit the asymmetry on the same hard set still fail their gate, but methods that depend only on the per-token or verdict-style discrimination signal are not refuted. The paper closes with implications for the design of self-teacher RLVR pipelines and for the verifier-vs-base feedback architecture in any RL-with-feedback method that bottlenecks through the base's internal discriminator.