Position: Time-Series Foundation Models Require Explicit Domain-Level Benchmarks
Abstract
Time series foundation models (TSFMs) have demonstrated strong performance on established benchmarks such as GIFT-Eval, Monash, and TSFM-Bench. However, these benchmarks pool datasets from many domains with uneven representation, which can obscure performance within specific application areas such as healthcare, finance, nature, retail, and transport. The necessity for domain-specific evaluation arises from the inherent structural diversity of time series data: clinical records often feature irregular sampling and informative missingness; financial sequences are characterized by high noise and stochastic trajectories; and environmental data, such as energy and weather, are governed by deterministic physical laws and strong seasonal hierarchies. Motivated by this heterogeneity, we argue that TSFMs require explicit domain-specific benchmarks so practitioners can reliably assess a model's utility within their own application area. This is because cross-domain differences in data generation, sampling irregularity, and nonstationarity under concept drift fundamentally shape forecasting difficulty and failure modes. As a result, strong performance on aggregated leaderboards may not translate to reliable deployment within a specific domain. To test this, we evaluated seven TSFMs across 72 datasets from six domains (healthcare, finance, energy, nature, transport, and retail) and found substantial cross-domain variability. These findings confirm that global benchmark scores can be misleading and that domain-aware evaluations are essential for trustworthy TSFM selection.