Maximum-Likelihood Learning of Latent Dynamics Without Reconstruction
Abstract
We address the challenge of uncovering systematic, and potentially controllable, dynamical structure underlying complex high-dimensional time series data. Existing generative and autoregressive models have difficulty telling systematic content apart from distractors, while contrastive methods struggle to learn accurate dynamics. To address these shortcomings, we develop the Recognition-Parametrized Gaussian State Space Model (RP-GSSM), a probabilistic framework that infers accurate latent dynamics without relying on a parametrized decoder. By eliminating explicit generative parameters, the model directs its entire representational capacity to encoding dynamically relevant state; and, being fully probabilistic, learns via maximum likelihood without auxiliary objectives or ad-hoc regularization. Combining the expressive power of a neural network encoder with exact inference under a jointly Gaussian prior allows the RP-GSSM to embed a broad class of intrinsically nonlinear dynamical systems. We show that the RP-GSSM recovers physically meaningful latent states from noisy video more faithfully than competing methods, more reliably identifies underlying controllable nonlinear dynamics, and remains substantially more robust to visual distractors.