Corpus Composition Matters: Mixing Synthetic and Real Data for Time-Series Foundation Model Pretraining
Abstract
Synthetic data is increasingly used to pretrain time-series foundation models, but it remains unclear how synthetic generators should be chosen and how synthetic data should be combined with real data. We study this as a corpus composition problem. We first audit 11 synthetic generators spanning 11.2B time points and use a lightweight feature-space diagnostic to select three representative generator regimes for downstream testing. We then train Chronos-T5-Mini and Moirai-Small from scratch on representative single-generator corpora, an equal mixture of all 11 generators, a real reference corpus, and real-synthetic mixtures at 75/25, 50/50, and 25/75 real/synthetic window ratios. Evaluated zero-shot on GIFT-Eval, a multi-domain forecasting benchmark, both architectures show a clear composition effect: representative single-generator corpora are brittle and multi-generator mixtures are stronger synthetic-only baselines. Real-synthetic mixtures improve over synthetic-only training for both architectures and, for Moirai-Small, outperform both pure real and pure synthetic training; for Chronos-T5-Mini, they remain competitive with the real-only reference. Synthetic pretraining should therefore be treated as corpus design, not as a replacement question.