PostTrainBench: Can LLM Agents Automate LLM Post-Training?
Abstract
Given the recent rapid progress of LLM agents like Claude Code or Codex CLI for software engineering, an important next question is whether they can automate AI research itself. In this paper, we study post-training, which is the critical step that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU). We task frontier agents (e.g., Claude Code with Opus 4.5) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3-4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial progress but generally lag behind instruction-tuned LLMs from leading providers: 21.5% for the best agent vs. 51.1% for official instruction-tuned models. However, agents can exceed instruction-tuned models in targeted scenarios: GPT-5.1 Codex Max achieves 89% on BFCL with Gemma-3-4B vs. 67% for the official model. Additionally, we document concerning behaviors related to reward hacking, such as training on test data or downloading pre-existing instruction-tuned models, and unauthorized usage of API keys for synthetic data generation. Overall, we expect PostTrainBench to serve as an important benchmark for tracking both capabilities and risks of AI R&D automation.