HIER: Human-in-the-Loop Imagination–Execution Refinement for General Real-World Vision-Language-Action Models
Abstract
Supervised fine-tuning (SFT) is a dominant post-training strategy for vision-language-action (VLA) models, but its reliance on scarce expert demonstrations limits scalability and generalization. We propose HIER, a plug-and-play four-stage refinement framework that improves real-world VLA policies with minimal demonstrations by combining world-model imagination and human-in-the-loop correction. HIER warm-starts a VLA policy and a pretrained world model from a few demonstrations, then splits the policy into a deployment branch and an exploration branch. The deployment branch interacts with the world model to generate imagined rollouts, which are used to fine-tune the exploration branch for autonomous execution with occasional human interventions. The resulting corrected rollouts are preference-filtered and used to fine-tune the deployment branch, which is the final policy used for real-world inference. Across multiple real-world manipulation tasks on a Franka arm, HIER achieves nearly 100\% success with few demonstrations and improves success rates by more than 50\% relative to SFT, while in some tasks attaining shorter episode lengths than human demonstrations, indicating improved execution efficiency. Ablations further show that imagination-driven diversification and human correction are crucial for gains in exploration and self-recovery.