Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting
Wanjin Feng ⋅ Yuan Yuan ⋅ Ding ⋅ Yong Li
Abstract
In the era of increasingly complex AI models for time series forecasting, progress is often measured by marginal improvements on benchmark leaderboards. However, standard evaluations rely on aggregate metrics (e.g., MSE) that conflate model capability with the intrinsic difficulty of the evaluated instances. To address this, we propose a diagnostic framework anchored in **Spectral Coherence Predictability (SCP)**, which provides an efficient $\mathcal{O}(N\log N)$ per-instance difficulty reference and yields a corresponding linear MSE lower bound. Complementing this, we introduce the **Linear Utilization Ratio (LUR)** to quantify how effectively models exploit linearly predictable structures across frequencies. Experiments on synthetic and real-world benchmarks show that SCP aligns strongly with realized forecasting errors across diverse state-of-the-art forecasters. Using this lens, we uncover ``predictability drift,'' revealing that task difficulty is not static but fluctuates significantly over time and variables. Furthermore, stratified evaluation exposes complementary architectural strengths across distinct frequency bands and difficulty regimes. Overall, we advocate moving beyond leaderboard-style ranking toward a more insightful, predictability-aware evaluation that fosters fairer model comparisons and a deeper understanding of model behavior. Code and data are available at https://anonymous.4open.science/r/TS_Predictability-C8B7.
Successful Page Load