D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training
Abstract
Training data plays a central role in large language model (LLM) optimization, motivating extensive research on data scheduling strategies. Most prior work focuses on data selection and implicitly assumes that, once the training subset is fixed, the order in which data are presented is interchangeable. However, this assumption is routinely violated in practice. Despite empirical evidence of order sensitivity, existing studies neither provide a principled explanation of the underlying optimization dynamics nor offer an efficient solution. In this work, we first answer the fundamental question of why training order matters in LLM optimization. We then demonstrate that commonly used empirical data ordering heuristics are suboptimal from an optimization perspective. To resolve this, we propose xxx, a data scheduling framework grounded in gradient interactions between samples, where training dependencies are modeled as a graph that explicitly constrains valid training orders. Our approach is theoretically motivated and yields consistent empirical improvements over existing data scheduling methods across multiple settings.