Training AI Co-Scientists Using Rubric Rewards
Abstract
AI co-scientists are emerging as a useful tool for human researchers, with a crucial ability being proposing a research plan for a given research goal. In this work, we study how to train language models that generate better research plans by leveraging the vast corpus of existing research papers. To collect diverse training data, we automatically extract research goals and goal-specific grading rubrics from papers across domains. We then train models for research plan generation via reinforcement learning, with a frozen copy of the initial policy acting as the grader, using the rubrics to evaluate plans generated by the training policy. To validate this approach, we conduct a human study for machine learning research goals spanning 225 expert hours. The experts prefer plans generated by our finetuned Qwen3-30B-A3B model over the initial model for 70% goals, and over Grok-4-Thinking for 59.6% goals. To assess generality, we also extend our approach to goals from medical papers, and recent arXiv preprints, evaluating with a jury of frontier models. Our finetuning yields 12-22% relative improvements and significant cross-domain generalization, proving effective even in problem settings like medical research where execution feedback is infeasible. Overall, we demonstrate the potential of a scalable training recipe as a step towards improving general AI co-scientists.