DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use
Abstract
Recent work increasingly synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized training tasks. Scaling diversity is difficult because training requires tasks to remain executable and verifiable, while generalization demands diverse tool types, toolset combinations, and heterogeneous tool-use patterns. We propose DIVE, an evidence-driven recipe that inverts synthesis order, executing diverse real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces, providing grounding by construction. DIVE scales structural diversity along two controllable axes, tool-pool coverage and per-task toolset variety, synthesizing 48k trajectories over 374 tools across five domains that cover 46,398 unique toolsets and 39,810 unique tool-call graphs. Training Qwen3-8B on DIVE data (48k SFT + 3.2k RL) improves by +22 average points across 9 OOD benchmarks and outperforms the strongest 8B baseline by +68%. Under a fixed budget, controlled scaling shows diversity scaling consistently outperforms quantity scaling, even with 4× less data.