Poster
in
Workshop: RLxF: RL from World Feedback Fri, Jul 10, 2026 • 12:00 AM – 1:00 AM PDT

Learning Diffusion Planners from World Feedback: A No-Go Result on Bit-Exact Safety Rewards and an ODD-Adaptive Shared/Expert Decomposition

Yun Li ⋅ Ehsan Javanmardi ⋅ Yidu Zhang ⋅ Simon Thompson ⋅ Qunli Zhang ⋅ Zifan Zeng ⋅ Shiming Liu ⋅ Peng Wang ⋅ Zixuan Guo ⋅ Manabu Tsukada

Project Page

Abstract

We fine-tune diffusion-based trajectory planners using closed-loop safety, efficiency, comfort, and real-time deployment outcomes as reward, learning from world feedback rather than human preferences. Across $8$ operational design domains (ODDs) in two independent stacks (Autoware + nuPlan, different sensors and planner backbones), world-grounded fine-tuning yields closed-loop gains within seed-to-seed noise but reproducibly exposes a Shared-Expert LoRA (SE-LoRA) structure: the reward gradient splits into a cross-ODD aligned subspace (geometric priors) and ODD-specific subspaces (denoising dynamics). Three independent signatures support the split: a Singapore$\leftrightarrow$Pittsburgh double dissociation under causal activation patching ($\rho = -0.96$), a $10\times$ gradient-cosine gap between components, and a cross-ODD PCA spectral gap that predicts a $1{:}2$ shared-to-expert rank ratio, confirmed by rank-allocation sweeps on both stacks. The same lens also produces a no-go result relevant to any world-grounded reward with hard safety thresholds: PCDR (Per-Closed-Loop Differentiable Reward), to our knowledge the first bit-exact differentiable mirror of nuPlan closed-loop scoring, exposes a forward-fidelity vs.\ gradient-reach trade-off for $0/1$ safety rewards, motivating a constrained-recovery design (smooth CLS proxy + bit-exact PCDR-derived margin penalty). We release PCDR, SE-LoRA, and anonymized real-vehicle deployment data (uncommon in the diffusion-planning literature) covering both simulation and a production ROS\,2 stack. World-grounded rewards reveal structural decompositions that BC-only training hides and impose hard limits on $0/1$ safety signals; the spectral-gap recipe predicts LoRA ranks without sweep, and PCDR provides a standalone audit of reward$\leftrightarrow$metric alignment.