Meta Flow Maps enable scalable reward alignment
Peter Potaptchik ⋅ Adhi Saravanan ⋅ Abbas Mammadov ⋅ Alvaro Prat ⋅ Michael Albergo ⋅ Yee-Whye Teh
Abstract
Controlling generative models—whether via inference-time steering or fine-tuning—is expensive. Control relies on estimating the value function—typically necessitating costly trajectory simulations. To eliminate this bottleneck, we introduce *Meta Flow Maps (MFMs)*, stochastic extensions of consistency models and flow maps. MFMs are trained to perform \textbf{one-step posterior sampling}, generating arbitrarily many i.i.d. draws of clean data $x_1$ from any noisy state $x_t$. Crucially, these samples are differentiable in the conditioning state $x_t$, unlocking efficient estimation of the value function gradient. We leverage this capability to enable both **inference-time steering** without inner rollouts, and unbiased, off-policy **fine-tuning** to general rewards. Among our fine-tuning and steering experiments on ImageNet, we highlight that our single-particle steered-MFM sampler outperforms a Best-of-1000 baseline across multiple rewards at a fraction of the compute.
Successful Page Load