Position: Creating High-Fidelity Synthetic Training Data Should Employ Multi-level Optimization
Abstract
The reliance of machine learning (ML) models on large-scale, high-quality labeled training data incurs significant challenges in specialized domains where such data is expensive and difficult to obtain. A promising solution is the automatic creation of synthetic training data. However, current approaches — including data generation, automated annotation, and domain adaptation — often fail to explicitly use downstream model performance to guide the creation and refinement of synthetic training data. This position paper argues that multi-level optimization (MLO) is essential for producing high-fidelity synthetic data by enabling joint optimization of data generation, annotation, adaptation, and selection, all informed by downstream model performance. We advocate for MLO as a unified framework to address three critical challenges: (1) improving data generation by aligning synthetic data with model needs, particularly targeting class-specific deficiencies and worst-case robustness; (2) enhancing automated annotation through sequential verification and the use of large language models for more accurate labeling; and (3) enabling example-specific adaptation and selection to maximize data utility while preventing excessive over-adaptation. By facilitating end-to-end coordination across multiple learning stages, MLO offers a potential paradigm shift in synthetic data creation for data-scarce domains.