RePro: Training Language Models to Faithfully Recycle the Web for Pretraining
Zichun Yu ⋅ Chenyan Xiong
Abstract
High-quality data is a cornerstone of large language model (LLM) pretraining, yet its growth has not kept pace with the needs of frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one *quality* reward and three *faithfulness* rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a rephraser as small as 1B parameters to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M, 1.4B, and 2.8B models demonstrate that RePro delivers 3.7\%-14.5\% relative accuracy gains over organic-only baseline on 22 downstream tasks, doubling the performance gains achieved by the state-of-the-art web recycling method that prompts a 70B rephraser. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3$\times$. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively recycle organic data for pretraining. Our anonymized code is available at https://anonymous.4open.science/r/RePro.
Successful Page Load