VFMF: Dense Forecasting by Generating Foundation Model Features
Abstract
Forecasting by generating RGB videos is computationally expensive, often physically implausible, and not directly actionable, since it requires translation into decision-making signals. Direct modality forecasting (e.g., predicting future segmentation) produces directly actionable outputs but fails to scale due to the need for labels. Vision Foundation Model (VFM) features offer the best of both worlds: they contain actionable semantic and geometric information that can be easily decoded from the predicted features, while requiring no labels on the downstream task for training. However, almost all existing VFM feature forecasting methods regress future features from fixed number of input frames, with evaluation predominantly on short horizons matching the training setup. We firstly show that existing regression methods struggle with forecasting from partial observations because they average over multiple plausible futures, failing to capture uncertainty in the future given the past. Interestingly, naively replacing deterministic forecasting with generative flow matching does not match the sample quality of the regression model, despite being a mathematically appropriate formulation of the forecasting task. In this work, we explain why this is the case, and we show how to optimally generate foundation model features. Our key insight is that generative modeling of VFM features requires (auto)encoding into a compact latent space suitable for diffusion. We show that this latent space preserves information more effectively than previously used alternatives, such as uncompressed feature diffusion or PCA-based compression, both for forecasting and other applications, such as image generation. Our results suggest that conditional generation of (compressed) VFM features offers a promising and scalable foundation for future scene forecasters.