daVinci-Dev: Agent-native Mid-training for Software Engineering
Abstract
Recently, the frontier of Large Language Model (LLM) capabilities has shifted from single-turn code generation to agentic software engineering—a paradigm where models autonomously navigate, edit, and test complex repositories. While post-training methods have become the de facto approach for code agents, agentic mid-training—mid-training (MT) on large-scale data that mirrors authentic agentic workflows—remains critically underexplored due to substantial resource requirements, despite offering a more scalable path to instilling foundational agentic behaviors than relying solely on expensive reinforcement learning. A central challenge in realizing effective agentic mid-training is the distribution mismatch between static training data and the dynamic, feedback-rich environment of real development. To address this, we present a systematic study of agentic mid-training, establishing both the data synthesis principles and training methodology for effective agent development at scale. Central to our approach is agent-native data—supervision comprising two complementary types of trajectories: contextually-native trajectories that preserve the complete information flow an agent experiences, offering broad coverage and diversity; and environmentally-native trajectories collected from executable repositories where observations stem from actual tool invocations and test executions, providing depth and interaction authenticity. We verify the model’s agentic capabilities on SWE-Bench Verified. We demonstrate our superiority over the previous open software engineering mid-training recipe Kimi-Dev under two post-training settings with an aligned base model and agentic scaffold, while using less than half mid-training tokens (73.1B). Besides relative advantage, our best performing 32B and 72B models achieve 56.1% and 58.5% resolution rates, respectively, which are state-of-the-art among open training recipes using agentic scaffolds under their model sizes, despite starting from non-coder Qwen2.5-Base base models. Beyond these agentic capabilities, we also observe performance gains on general code generation and scientific benchmarks. We plan to open-source a significant portion of our datasets, recipes, and model checkpoints—resources representing substantial computational investment typically unavailable to the broader community—to facilitate further research in this underexplored paradigm.