VJEPA: Variational Joint Embedding Predictive Architectures as Probabilistic World Models
Abstract
Joint Embedding Predictive Architectures (JEPA) offer a scalable paradigm for self-supervised learning by predicting latent representations rather than reconstructing high-entropy observations. However, existing formulations rely on deterministic regression objectives, which masks probabilistic semantics and limits its applicability in stochastic control. We introduce \emph{Variational JEPA (VJEPA)}, a probabilistic generalization that learns a predictive distribution over future latent states via a variational objective. We show that VJEPA unifies representation learning with Predictive State Representations (PSRs) and Bayesian filtering, establishing that sequential modeling does not require autoregressive observation likelihoods. Theoretically, we prove that VJEPA representations serve as sufficient information states for optimal control without pixel reconstruction, while providing formal guarantees for collapse avoidance. We further propose \emph{Bayesian JEPA (BJEPA)}, which extends the VJEPA framework to factorize predictive belief into a learned dynamics expert and a modular prior expert, enabling zero-shot task transfer and constraints satisfactions (e.g., goals, physics) via a Product of Experts. Empirically, VJEPA filters out high-variance nuisance distractors that cause representation collapse in generative baselines. By enabling principled uncertainty estimation (e.g. constructing credible intervals via sampling) while remaining likelihood-free regarding observations, VJEPA provides a foundational framework for scalable, robust, uncertainty-aware planning in high-dimensional, noisy environments.