From Frames to Stories (F2S): Toward Reliable, Controllable and Trustworthy Long-Horizon Video Generation
Abstract
Video generation has advanced rapidly for short clips, but minutes-long, multi-shot generation remains unreliable due to compounding errors, identity drift, and weak long-range coherence. Long-horizon video therefore provides a demanding testbed for long-context multimodal modeling, inference-time computation, and interactive generation, aligning closely with core ICML interests. We focus on three bottlenecks: (i) persistent state representation, what to store (and how to compress it) to ensure that identities, scene dynamics, and narrative facts remain consistent; (ii) interactive control that steers future states with rich and compositional signals (shot plans, localized edits, multimodal constraints, actions) over long horizons; and (iii) trustworthy evaluation via minutes-scale protocols that measure consistency and control adherence in a reproducible, hard-to-game way. We highlight research directions where conceptual and methodological advances, rather than model scale alone, drive progress under realistic academic compute budgets. The program combines invited talks, contributed spotlights, posters/demos, a panel, and breakout groups on open problems with report-back.