Position: RL Should Be Used to Adjust Foundation Models, NOT Abused
Abstract
This position paper argues that reinforcement learning (RL) should be used to adjust foundation models after pretraining and cold-start supervision, not abused as a default recipe for capability creation or early-stage training. We view RL as a high-cost, high-leverage post-training operator that reallocates probability mass toward behaviors a model can already express, but rarely creates new reasoning capacities from scratch in a compute-efficient, stable, and controllable way. This distinction matters now because “RL-zero” narratives risk normalizing expensive and brittle RL-first pipelines as the primary path to reasoning, even though practice increasingly shows that cold-start supervision is a prerequisite for reliable RL and that RL is most effective as targeted refinement. Across modalities and domains, we emphasize a recurring regularity: supervision establishes usable reasoning structure, while RL mainly sharpens correctness, consistency, and constraint satisfaction, especially under hard constraints or distribution shift. We further argue for reward minimalism: simple, verifiable rewards often suffice and reduce proxy-driven failure modes relative to over-engineered reward models. Finally, we discuss how self-supervised RL can support self-evolution when grounded in verifiable signals and structured interaction environments. Together, these arguments motivate treating RL as a disciplined adjustment stage with explicit entry criteria and compute-accountable evaluation.