Timezone: »
Recent advances in algorithmic design and principled, theorydriven deep learning architectures have sparked a growing interest in control and dynamical system theory. Complementary, machine learning plays an important role in enhancing existing control theory algorithms in terms of performance and scalability. The boundaries between both disciplines are blurring even further with the rise of modern reinforcement learning, a field at the crossroad of datadriven control theory and machine learning. This workshop aims to unravel the mutual relationship between learning, control, and dynamical systems and to shed light on recent parallel developments in different communities. Strengthening the connection between learning and control will open new possibilities for interdisciplinary research areas.
Fri 12:00 p.m.  12:45 p.m.

On optimal control and machine learning
(
Tutorial
)
SlidesLive Video » This talk tours the optimal control and machine learning methodologies behind recent breakthroughs in the field. These are crucial components for building agents capable of computationally modeling and interacting with our world via planning and reasoning, e.g. for robotics, aircrafts, autonomous vehicles, games, economics, finance, and language, as well as agricultural, biomedical,chemical, industrial, and mechanical systems. We will start with 1) a lightweight introduction to optimal control, and then cover 2) machine learning for optimal control  this includes reinforcement learning and overviews how the powerful abstractive and predictive capabilities of machine learning can drastically improve every part of a control system; and 3) optimal control for machine learning  surprisingly in this opposite direction, some machine learning problems are able to be formulated as control problems and solved with optimal control methods, e.g. parts of diffusion models, optimal transport,and optimizing the parameters of models such as large language models with reinforcement learning. 
Brandon Amos 🔗 
Fri 12:45 p.m.  1:30 p.m.

Twoforone: diffusion models and force fields for coarsegrained molecular dynamics
(
Presentation
)
SlidesLive Video » In this work I will cover work from the Microsoft Research AI4Science team on the use of scorebased generative modeling for coarsegraining (CG) molecular dynamics simulations. By training a diffusion model on protein structures from molecular dynamics simulations we show that its score function approximates a force field that can directly be used to simulate CG molecular dynamics. While having a vastly simplified training setup compared to previous work, we demonstrate that our approach leads to improved performance across several small to mediumsized protein simulations, reproducing the CG equilibrium distribution, and preserving dynamics of allatom simulations such as protein folding events. 
Rianne Van den Berg 🔗 
Fri 1:30 p.m.  1:45 p.m.

Transport, VI, and Diffusions
(
Presentation
)
link »
SlidesLive Video » This paper explores the connections between optimal transport and variational inference, with a focus on forward and reverse time stochastic differential equations and Girsanov transformations.We present a principled and systematic framework for sampling and generative modelling centred around divergences on path space. Our work culminates in the development of a novel scorebased annealed flow technique and a regularised iterative proportional fitting (IPF)type objective, departing from the sequential nature of standard IPF. Through a series of generative modelling examples and a doublewellbased rare event task, we showcase the potential of the proposed methods. 
Francisco Vargas · Nikolas Nüsken 🔗 
Fri 1:45 p.m.  2:30 p.m.

Imposing and learning structure in OT displacements through cost engineering
(
Presentation
)
link »
SlidesLive Video » I will highlight in this work the flexibility provided by the GangboMcCann theorem, which provides a generic way to tie kantorovich dual potential solutions to optimal maps for the Monge problem. We show in particular how setting the ground cost to the squaredEuclidean distance + a regularizer induces displacements that have a structure that is well suited to that regularizer (e.g. sparse if that regularizer is the L1 norm). We propose an approach, in more recent work, to learn parameters of that regularizer. 
Marco Cuturi 🔗 
Fri 2:30 p.m.  3:15 p.m.

Designing HighDimensional ClosedLoop Optimal Control Using Deep Neural Networks
(
Presentation
)
link »
SlidesLive Video » Designing closedloop optimal control for highdimensional nonlinear systems remains a persistent challenge. Traditional methods, such as solving the HamiltonJacobiBellman equation, suffer from the curse of dimensionality. Recent studies introduced a promising supervised learning approach, akin to imitation learning, that uses deep neural networks to learn from openloop optimal control solutions. In this talk, we'll explore this method, highlighting a limitation in its basic form: the distribution mismatch phenomenon, induced by controlled dynamics. To overcome this, we present an improved approach—the initial value problem enhanced sampling method. This method not only provides a theoretical edge over the basic version in the linearquadratic regulator but also showcases substantial numerical improvement on various highdimensional nonlinear problems, including the optimal reaching problem of a 7 DoF manipulator. Notably, our method also surpasses the Dataset Aggregation (DAGGER) algorithm, widely adopted in imitation learning, with significant theoretical and practical advantages. 
Jiequn Han 🔗 
Fri 4:45 p.m.  5:30 p.m.

Safe Learning in Control
(
Presentation
)
SlidesLive Video » In many applications of autonomy in robotics, guarantees that constraints are satisfied throughout the learning process are paramount. We present a controller synthesis technique based on the computation of reachable sets, using optimal control and game theory. Then, we present methods for combining reachability with learningbased methods, to enable performance improvement while maintaining safety, and to move towards safe robot control with learned models of the dynamics and the environment. We will discuss different interaction models with other agents. Finally, we will illustrate these safe learning methods on robotic platforms at Berkeley, discussing applications in automated airspace management and air taxi operations. 
Claire Tomlin 🔗 
Fri 5:30 p.m.  5:45 p.m.

Bridging RL Theory and Practice with the Effective Horizon
(
Presentation
)
link »
SlidesLive Video » Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instancedependent bounds. We find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Qvalues under the random policy also have the highest Qvalues under the optimal policy—i.e., when it is optimal to act greedily with respect to the random's policy Q function—deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizonbased bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also show that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pretrained exploration policy. 
Cassidy Laidlaw · Stuart Russell · Anca Dragan 🔗 
Fri 6:00 p.m.  6:45 p.m.

Reinforcement Learning and MultiAgent Reinforcement Learning
(
Presentation
)
SlidesLive Video » Reinforcement learning (RL) has emerged as a powerful paradigm for enabling intelligent agents to solve sequential decisionmaking problems under uncertainties. It has witnessed remarkable successes in various domains, ranging from gameplaying agents to autonomous systems. However, as realworld challenges become increasingly intricate and interconnected, there is a need to go beyond the singleagent framework. Multiagent reinforcement learning (MARL), is an extension of RL that enables multiple agents to learn and interact, introducing a new dimension of complexity and sophistication. This talk delves into the exciting realm of RL and MARL, exploring the foundational principles, recent advancements, and promising applications of these techniques. We begin by introducing the core concepts of RL. Building upon this foundation, we shift our focus to MARL, where multiple agents learn simultaneously, either cooperating or competing with each other. Then, we examine the challenges posed by MARL, including coordination, communication, and the explorationexploitation dilemma. 
Giorgia Ramponi 🔗 
Fri 6:45 p.m.  7:00 p.m.

Modeling Accurate Long Rollouts with Temporal Neural PDE Solvers
(
Presentation
)
link »
SlidesLive Video » Timedependent partial differential equations (PDEs) are ubiquitous in science and engineering. Recently, mostly due to the high computational cost of traditional solution techniques, deep neural network based surrogates have gained increased interest. The practical utility of such neural PDE solvers relies on their ability to provide accurate, stable predictions over long time horizons, which is a notoriously hard problem. In this work, we present a largescale analysis of common temporal rollout strategies, identifying the neglect of nondominant spatial frequency information, often associated with high frequencies in PDE solutions, as the primary pitfall limiting stable, accurate rollout performance. Based on these insights, we draw inspiration from recent advances in diffusion models to introduce PDERefiner; a novel model class that enables more accurate modeling of all frequency components via a multistep refinement process. We validate PDERefiner on challenging benchmarks of complex fluid dynamics, demonstrating stable and accurate rollouts that consistently outperform stateoftheart models, including neural, numerical, and hybrid neuralnumerical architectures. Finally, PDERefiner's connection to diffusion models enables an accurate and efficient assessment of the model's predictive uncertainty, allowing us to estimate when the surrogate becomes inaccurate. 
Phillip Lippe · Bastiaan Veeling · Paris Perdikaris · Richard E Turner · Johannes Brandstetter 🔗 


Analyzing the Sample Complexity of ModelFree Opponent Shaping
(
Poster
)
link »
In mixedincentive multiagent environments, methods developed for zerosum games often yield collectively suboptimal results. Addressing this, \textit{opponent shaping} (OS) strategies aim to actively guide the learning processes of other agents, empirically leading to enhanced individual and group performances. Early OS methods use higherorder derivatives to shape the learning of coplayers, making them unsuitable to anticipate multiple learning steps ahead. Followup work Modelfree Opponent Shaping (MFOS) addresses the shortcomings of earlier OS methods by reframing the OS problem into a metagame. In the metagame, the metastep corresponds to an episode of the ``inner'' game. The OS metastate corresponds to the inner policies, while the metapolicy outputs an inner policy at each metastep. Leveraging modelfree optimization techniques, MFOS learns metapolicies that demonstrate longhorizon opponent shaping, e.g., by discovering a novel extortion strategy in the Iterated Prisoner's Dilemma (IPD). In contrast to early OS methods, there is little theoretical understanding of the MFOS framework. In this work, we derive the sample complexity bounds for the MFOS agents theoretically and empirically. To quantify the sample complexity, we adapt the $R_{max}$ algorithm, most prominently used to derive sample bounds for MDPs, as the metalearner in the MFOS framework and derive an exponential sample complexity. Our theoretical results are empirically supported in the Matching Pennies environment.

Kitty Fung · Qizhen Zhang · Christopher Lu · Timon Willi · Jakob Foerster 🔗 


A Best Arm Identification Approach for Stochastic Rising Bandits
(
Poster
)
link »
Stochastic Rising Bandits (SRBs) model sequential decisionmaking problems in which the expected rewards of the available options increase every time they are selected. This setting captures a wide range of scenarios in which the available options are learning entities whose performance improves (in expectation) over time. While previous works addressed the regret minimization problem, this paper, focuses on the fixedbudget Best Arm Identification (BAI) problem for SRBs. In this scenario, given a fixed budget of rounds, we are asked to provide a recommendation about the best option at the end of the identification process. We propose two algorithms to tackle the abovementioned setting, namely RUCBE, which resorts to a UCBlike approach, and RSR, which employs a successive reject procedure. Then, we prove that, with a sufficiently large budget, they provide guarantees on the probability of properly identifying the optimal option at the end of the learning process. Furthermore, we derive a lower bound on the error probability, matched by our RSR (up to logarithmic factors), and illustrate how the need for a sufficiently large budget is unavoidable in the SRB setting. Finally, we numerically validate the proposed algorithms in both synthetic and realworld environments and compare them with the currently available BAI strategies. 
Alessandro Montenegro · Marco Mussi · Francesco Trovò · Marcello Restelli · Alberto Maria Metelli 🔗 


Tendiffpure: Tensorizing Diffusion Models for Purification
(
Poster
)
link »
Diffusion models are effective purification methods where the noises or adversarial attacks are removed using generative approaches before preexisting classifiers conducting classification tasks. However, the efficiency of diffusion models is still a concern and existing solutions are based on knowledge distillation which can jeopardize the generation quality because of the small number of generation steps. Hence we propose Tendiffpure as a tensorized diffusion models to compress diffusion models for purification. Unlike the knowledge distillation methods, we directly compress unets as backbones of diffusion models using tensortrain decomposition which reduce the number of parameters and captures more spatial information in multidimensional data such as images. The space complexity is reduced from $\mathit{O}(N^2)$ to $\mathit{O}(NR^2)$ with $R\leq 4$. Experimental results show that Tendiffpure can more efficiently generate high quality purified results and outperform the baselines purification methods on CIFAR10, FashionMNIST and MNIST datasets for two noises and one adversarial attack.

Zhou Derun · Mingyuan Bai · Qibin Zhao 🔗 


Continuous Vector Quantile Regression
(
Poster
)
link »
Vector quantile regression (VQR) estimates the conditional vector quantile function (CVQF), a fundamental quantity which fully represents the conditional distribution of $\mathbf{Y}\mathbf{X}$. VQR is formulated as an optimal transport (OT) problem between a uniform $\mathbf{U}\sim\mu$ and the target $(\mathbf{X},\mathbf{Y})\sim\nu$, the solution of which is a unique transport map, comonotonic with $\mathbf{U}$. Recently NLVQR has been proposed to estimate support nonlinear CVQFs, together with fast solvers which enabled the use of this tool in practical applications. Despite its utility, the scalability and estimation quality of NLVQR is limited due to a discretization of the OT problem onto a grid of quantile levels. We propose a novel _continuous_ formulation and parametrization of VQR using partial inputconvex neural networks (PICNNs). Our approach allows for accurate, scalable, differentiable and invertible estimation of nonlinear CVQFs.We further demonstrate, theoretically and experimentally, how continuous CVQFs can be used for general statistical inference tasks: estimation of likelihoods, CDFs, confidence sets, coverage, sampling, and more.This work is an important step towards unlocking the full potential of VQR.

Sanketh Vedula · Irene Tallini · Aviv A. Rosenberg · Marco Pegoraro · Emanuele Rodola · Yaniv Romano · Alexander Bronstein 🔗 


Informed POMDP: Leveraging Additional Information in ModelBased RL
(
Poster
)
link »
In this work, we generalize the problem of learning through interaction in a POMDP by accounting for eventual additional information available at training time. First, we introduce the informed POMDP, a new learning paradigm offering a clear distinction between the training information and the execution observation. Next, we propose an objective for learning sufficient statistics from the history for the optimal control that leverages this information. We then show that this informed objective consists of learning an environment model from which we can sample latent trajectories. Finally, we show for the Dreamer algorithm that the convergence speed of the policies is sometimes greatly improved on several environments by using this informed environment model. Those results and the simplicity of the proposed adaptation advocate for a systematic consideration of eventual additional information when learning in a POMDP using modelbased RL. 
Gaspard Lambrechts · Adrien Bolland · Damien Ernst 🔗 


Embedding Surfaces by Optimizing Neural Networks with Prescribed Riemannian Metric and Beyond
(
Poster
)
link »
From a machine learning perspective, the problem of solving partial differential equations (PDEs) can be formulated into a least square minimization problem, where neural networks are used to parametrized PDE solutions. Ideally a global minimizer of the square loss corresponds to a solution of the PDE. In this paper we start with a special type of nonlinear PDE arising from differential geometry, the isometric embedding equation, which relates to many longstanding open questions in geometry and analysis. We show that the gradient descent method can identify a global minimizer of the leastsquare loss function with twolayer neural networks under the assumption of overparametrization. As a consequence, this solves the surface embedding locally with a prescribed Riemannian metric. We also extend the convergence analysis for gradient descent to higher order linear PDEs with overparametrization assumption. 
Yi Feng · Sizhe Li · Ioannis Panageas · Xiao Wang 🔗 


Taylor TDlearning
(
Poster
)
link »
Many reinforcement learning approaches rely on temporaldifference (TD) learning to learn a critic.However, TDlearning updates can be high variance due to their sole reliance on Monte Carlo estimates of the updates.Here, we introduce a modelbased RL framework, Taylor TD, which reduces this variance. Taylor TD uses a firstorder Taylor series expansion of TD updates.This expansion allows to analytically integrate over stochasticity in the actionchoice, and some stochasticity in the state distribution for the initial state and action of each TD update.We include theoretical and empirical evidence of Taylor TD updates being lower variance than (standard) TD updates. Additionally, we show that Taylor TD has the same stable learning guarantees as (standard) TDlearning under linear function approximation.Next, we combine Taylor TD with the TD3 algorithm (Fujimoto et al., 2018), into TaTD3.We show TaTD3 performs as well, if not better, than several stateofthe art modelfree and modelbased baseline algorithms on a set of standard benchmark tasks.Finally, we include further analysis of the settings in which Taylor TD may be most beneficial to performance relative to standard TDlearning. 
Michele Garibbo · Maxime Robeyns · Laurence Aitchison 🔗 


Toward Understanding Latent Model Learning in MuZero: A Case Study in Linear Quadratic Gaussian Control
(
Poster
)
link »
We study the problem of representation learning for control from partial and potentially highdimensional observations. We approach this problem via direct latent model learning, where one directly learns a dynamical model in some latent state space by predicting costs. In particular, we establish finitesample guarantees of finding a nearoptimal representation function and a nearoptimal controller using the directly learned latent model for infinitehorizon timeinvariant Linear Quadratic Gaussian (LQG) control. A part of our approach to latent model learning closely resembles MuZero, a recent breakthrough in empirical reinforcement learning, in that it learns latent dynamics implicitly by predicting cumulative costs. A key technical contribution of this work is to prove persistency of excitation for a new stochastic process that arises from our analysis of quadratic regression in our approach. 
Yi Tian · Kaiqing Zhang · Russ Tedrake · Suvrit Sra 🔗 


Balancing exploration and exploitation in Partially Observed Linear Contextual Bandits via Thompson Sampling
(
Poster
)
link »
Contextual bandits constitute a popular framework for studying the explorationexploitation tradeoff under finitely many options with side information. In the majority of the existing works, contexts are assumed perfectly observed, while in practice it is more reasonable to assume that they are observed partially. In this work, we study reinforcement learning algorithms for contextual bandits with partial observations. First, we consider different structures for partial observability and their corresponding optimal policies. Subsequently, we present and analyze reinforcement learning algorithms for partially observed contextual bandits with noisy linear observation structures. For these algorithms that utilize Thompson sampling, we establish estimation accuracy and regret bounds under different structural assumptions. 
Hongju Park · Mohamad Kazem Shirani Faradonbeh 🔗 


Leveraging Factored Action Spaces for OffPolicy Evaluation
(
Poster
)
link »
In highstakes decisionmaking domains such as healthcare and selfdriving cars, offpolicy evaluation (OPE) can help practitioners understand the performance of a new policy before deployment by using observational data. However, when dealing with problems involving large and combinatorial action spaces, existing OPE estimators often suffer from substantial bias and/or variance. In this work, we investigate the role of factored action spaces in improving OPE. Specifically, we propose and study a new family of decomposed IS estimators that leverage the inherent factorisation structure of actions. We theoretically prove that our proposed estimator achieves lower variance and remains unbiased, subject to certain assumptions regarding the underlying problem structure. Empirically, we demonstrate that our estimator outperforms standard IS in terms of mean squared error and conduct sensitivity analyses probing the validity of various assumptions. Future work should investigate how to design or derive the factorisation for practical problems so as to maximally adhere to the theoretical assumptions. 
Aaman Rebello · Shengpu Tang · Jenna Wiens · Sonali Parbhoo 🔗 


Diffusion ModelAugmented Behavioral Cloning
(
Poster
)
link »
Imitation learning addresses the challenge of learning by observing an expert’s demonstrations without access to reward signals from the environment. Most existing imitation learning methods that do not require interacting with the environment either model the expert distribution as the conditional probability p(as) (e.g., behavioral cloning, BC) or the joint probability p(s, a) (e.g., implicit behavioral cloning). Despite its simplicity, modeling the conditional probability with BC usually struggles with generalization. While modeling the joint probability can lead to improved generalization performance, the inference procedure can be timeconsuming and it often suffers from manifold overfitting. This work proposes an imitation learning framework that benefits from modeling both the conditional and joint probability of the expert distribution. Our proposed diffusion modelaugmented behavioral cloning (DBC) employs a diffusion model trained to model expert behaviors and learns a policy to optimize both the BC loss (conditional) and our proposed diffusion model loss (joint). DBC outperforms baselines in various continuous control tasks in navigation, robot arm manipulation, dexterous manipulation, and locomotion. We design additional experiments to verify the limitations of modeling either the conditional probability or the joint probability of the expert distribution as well as compare different generative models. 
HsiangChun Wang · ShangFu Chen · MingHao Hsu · ChunMao Lai · ShaoHua Sun 🔗 


Unbalanced Diffusion Schrödinger Bridge
(
Poster
)
link »
Schrödinger bridges (SBs) provide an elegant framework for modeling the temporal evolution of populations in physical, chemical, or biological systems. Such natural processes are commonly subject to changes in population size over time due to the emergence of new species or birth and death events. However, existing neural parameterizations of SBs such as diffusion Schrödinger bridges ( DSBs) are restricted to settings in which the endpoints of the stochastic process are both probability measures and assume conservation of mass constraints. To address this limitation, we introduce unbalanced DSBs which model the temporal evolution of marginals with arbitrary finite mass. This is achieved by deriving the time reversal of stochastic differential equations (SDEs) with killing and birth terms. We present two novel algorithmic schemes that comprise a scalable objective function for training unbalanced DSBs and provide a theoretical analysis alongside challenging applications on predicting heterogeneous molecular singlecell responses to various cancer drugs and simulating the emergence and spread of new viral variants. 
Matteo Pariset · YaPing Hsieh · Charlotte Bunne · Andreas Krause · Valentin De Bortoli 🔗 


Aligned Diffusion Schrödinger Bridges
(
Poster
)
link »
Diffusion Schrödinger Bridges (DSBs) have recently emerged as a powerful framework for recovering stochastic dynamics via their marginal observations at different time points. Despite numerous successful applications, existing algorithms for solving DSBs have so far failed to utilize the structure of aligned data, which naturally arises in many biological phenomena. In this paper, we propose a novel algorithmic framework that, for the first time, solves DSBs while respecting the data alignment. Our approach hinges on a combination of two decadesold ideas: The classical Schrödinger bridge theory and Doob's $h$transform. Compared to prior methods, our approach leads to a simpler training procedure with lower variance, which we further augment with principled regularization schemes. This ultimately leads to sizeable improvements across experiments on synthetic and real data, including the tasks of predicting conformational changes in proteins and temporal evolution of cellular differentiation processes.

Vignesh Ram Somnath · Matteo Pariset · YaPing Hsieh · Maria Rodriguez Martinez · Andreas Krause · Charlotte Bunne 🔗 


Dynamic Featurebased Newsvendor
(
Poster
)
link »
In this paper, we investigate the dynamic featurebased newsvendor problem within a multiperiod inventory control setting featuring backlogged demands. Combining the significance of feature information with a multistage decisionmaking framework, we propose a general dynamic contextual newsvendor model. For this general model, we propose Contextual Value Iteration (CVI) algorithm and obtain its convergence rate to the optimal solution as well as sample complexity result. Our experimental result also demonstrates that our CVI is more efficient than value iteration for the vanilla Markovian Decision Process (MDP). 
Zexing Xu · Ziyi Chen · Xin Chen 🔗 


Equivalence Class Learning for GENERIC Systems
(
Poster
)
link »
In recent years, applications of neural networks to the modeling of physical phenomena have attracted much attention. This study proposes a method for learning systems that are described by the GENERIC formalism, which is a combination of analytical mechanics and nonequilibrium thermodynamics. GENERIC systems admit the energy conservation law and the law of increasing entropy under certain conditions. However, designing neural network models that satisfy these conditions is difficult. In this study, we introduce a relaxation model of the GENERIC form, thereby introducing an equivalence class into the set of models. Because the equivalence class of the target model includes a model that can be learned by neural networks, the learned model has the energy conservation law and the law of increasing entropy in high accuracy with respect to the true energy and the true entropy. 
Baige Xu · Yuhan Chen · Takashi Matsubara · Takaharu Yaguchi 🔗 


Variational Principle and Variational Integrators for Neural Symplectic Forms
(
Poster
)
link »
In this study, we investigate the variational principle for neural symplectic forms, thereby designing the variational integrators for this model. In recent years, neural networks models for physical phenomena have been attracting much attention. In particular, the neural symplectic form is a method that can model general Hamiltonian systems, which are not necessary in the canonical form. In this paper, we make the following two contributions regarding this model. Firstly, we show that this model is derived from a variational principle and hence admits the Noether theorem.Secondly, when the trained models are used for simulations, they must be discretized using numerical integrators; however, unless carefully designed, numerical integrators destroy physical laws. 
Yuhan Chen · Baige Xu · Takashi Matsubara · Takaharu Yaguchi 🔗 


Action and Trajectory Planning for Urban Autonomous Driving with Hierarchical Reinforcement Learning
(
Poster
)
link »
Reinforcement Learning (RL) has made promising progress in planning and decisionmaking for Autonomous Vehicles (AVs) in simple driving scenarios. However, existing RL algorithms for AVs fail to learn critical driving skills in complex urban scenarios. First, urban driving scenarios require AVs to handle multiple driving tasks of which conventional RL algorithms are incapable. Second, the presence of other vehicles in urban scenarios results in a dynamically changing environment, which challenges RL algorithms to planthe action and trajectory of the AV. In this work, we propose an action and trajectory planner using Hierarchical Reinforcement Learning (atHRL) method, which models the agent behavior in a hierarchical model by using the midlevel perception of the lidar and birdeye view. The proposed atHRL method learns to make decisions about the agent’s future trajectory and computes target waypoints under continuous settings based on a hierarchical DDPG algorithm. The waypoints planned by the atHRL model are then sent to a lowlevel controller to generate the steering and throttle commands required for the vehicle maneuver. We empirically verify the efficacy of atHRL through extensive experiments in complex urban driving scenarios that compose multiple tasks with the presence of other vehicles in the CARLA simulator. The experimental results suggest a significant performance improvement compared to the stateoftheart RL methods. 
Xinyang Lu · Xiaofeng Fan · Tianying Wang 🔗 


Accelerated Policy Gradient: On the Nesterov Momentum for Reinforcement Learning
(
Poster
)
link »
Policy gradient methods have recently been shown to enjoy global convergence at a $\Theta(1/t)$ rate in the nonregularized tabular softmax setting. Accordingly, one important research question is whether this convergence rate can be further improved, with only firstorder updates. In this paper, we answer the above question from the perspective of momentum by adapting the celebrated Nesterov's accelerated gradient (NAG) method to reinforcement learning (RL), termed *Accelerated Policy Gradient* (APG). To demonstrate the potential of APG in achieving faster global convergence, we start from the bandit setting and formally show that with the true gradient, APG with softmax policy parametrization converges to an optimal policy at a $\tilde{O}(1/t^2)$ rate. To the best of our knowledge, this is the first characterization of the global convergence rate of NAG in the context of RL. Notably, our analysis relies on one interesting finding: Regardless of the initialization, APG could end up reaching a locallyconcave regime, where APG could benefit significantly from the momentum, within finite iterations. By means of numerical validation, we confirm that APG exhibits $\tilde{O}(1/t^2)$ rate in the bandit setting and still preserves the $\tilde{O}(1/t^2)$ rate in various Markov decision process instances, showing that APG could significantly improve the convergence behavior over the standard policy gradient.

YenJu Chen · NaiChieh Huang · PingChun Hsieh 🔗 


Exponential weight averaging as damped harmonic motion
(
Poster
)
link »
The exponential moving average (EMA) is a commonly used statistic for providing stable estimates of stochastic quantities in deep learning optimization. Recently, EMA has seen considerable use in generative models, where it is computed with respect to the model weights, and significantly improves the stability of the inference model during and after training. While the practice of weight averaging at the end of training is wellstudied and known to improve estimates of local optima, the benefits of EMA over the course of training is less understood. In this paper, we derive an explicit connection between EMA and a damped harmonic system between two particles, where one particle (the EMA weights) is drawn to the other (the model weights) via an idealized zerolength spring. We then leverage this physical analogy to analyze the effectiveness of EMA, and propose an improved training algorithm, which we call \methodname{}. Finally, we demonstrate theoretically and empirically several advantages enjoyed by \methodname{} over standard EMA. 
Jonathan Patsenker · Henry Li · Yuval Kluger 🔗 


Algorithms for Optimal Adaptation of Diffusion Models to Reward Functions
(
Poster
)
link »
We develop algorithms for adapting pretrained diffusion models to optimize reward functions while retaining fidelity to the pretrained model. We propose a general framework for this adaptation that trades off fidelity to a pretrained diffusion model and achieving high reward. Our algorithms take advantage of the continuous nature of diffusion processes to pose rewardbased learning either as a trajectory optimization or continuous state reinforcement learning problem. We demonstrate the efficacy of our approach across several application domains, including the generation of time series of household power consumption and images satisfying specific constraints like the absence of memorized images or corruptions. 
Krishnamurthy Dvijotham · Shayegan Omidshafiei · Kimin Lee · Katie Collins · Deepak Ramachandran · Adrian Weller · Mohammad Ghavamzadeh · Milad Nasresfahani · Ying Fan · Jeremiah Liu 🔗 


On learning historybased policies for controlling Markov decision processes
(
Poster
)
link »
Reinforcement learning (RL) folklore suggests that historybased function approximation methods, such as recurrent neural nets or historybased state abstraction, perform better than their memoryless counterparts, due to the fact that function approximation in Markov decision processes (MDP) can be viewed as inducing a Partially observable MDP. However, there has been little formal analysis of such historybased algorithms, as most existing frameworks focus exclusively on memoryless features. In this paper, we introduce a theoretical framework for studying the behaviour of RL algorithms that learn to control an MDP using historybased feature abstraction mappings. Furthermore, we use this framework to design a practical RL algorithm and we numerically evaluate its effectiveness on a set of continuous control tasks. 
Gandharv Patil · Aditya Mahajan · Doina Precup 🔗 


Visual Dexterity: Inhand Dexterous Manipulation from Depth
(
Poster
)
link »
Inhand object reorientation is necessary for performing many dexterous manipulation tasks, such as tool use in unstructured environments that remain beyond the reach of current robots. Prior works built reorientation systems that assume one or many of the following specific circumstances: reorienting only specific objects with simple shapes, limited range of reorientation, slow or quasistatic manipulation, etc. We overcome these limitations and present a general object reorientation controller that is trained in simulation and evaluated in the real world. Our system uses readings from a single commodity depth camera to dynamically reorient complex objects by any amount in real time. The controller generalizes to new objects not used during training. It even demonstrates some capability of reorienting objects in the air held by a downwardfacing hand that must counteract gravity during reorientation. 
Tao Chen · Megha Tippur · Siyang Wu · Vikash Kumar · Edward Adelson · Pulkit Agrawal 🔗 


Learning from Sparse Offline Datasets via Conservative Density Estimation
(
Poster
)
link »
Offline reinforcement learning (RL) offers a promising direction for learning policies from precollected datasets without requiring further interactions with the environment. However, existing methods struggle to handle outofdistribution (OOD) extrapolation errors, especially in sparse reward or scarce data settings. In this paper, we propose a novel training algorithm called Conservative Density Estimation (CDE), which addresses this challenge by explicitly imposing constraints on the stateaction occupancy stationary distribution. CDE overcomes the limitations of existing approaches, such as the stationary distribution correction method, by addressing the support mismatch issue in marginal importance sampling. Our method achieves stateoftheart performance on the D4RL benchmark. Notably, CDE consistently outperforms baselines in challenging tasks with sparse rewards or insufficient data, demonstrating the advantages of our approach in addressing the extrapolation error problem in offline RL. 
Zhepeng Cen · Zuxin Liu · Zitong Wang · Yihang Yao · Henry Lam · Ding Zhao 🔗 


Undo Maps: A Tool for Adapting Policies to Perceptual Distortions
(
Poster
)
link »
People adapt to changes in their visual field all the time, like when their vision is occluded while driving. Agents trained with RL struggle to do the same. Here, we address how to transfer knowledge acquired in one domain to another when the domains differ in their state representation. For example, a policy may have been trained in an environment where states were represented as colored images, but we would now like to deploy this agent in a domain where images appear blackandwhite. We propose \textsc{Tail}taskagnostic imitation learninga framework which learns to undo these kinds of changes between domains in order to achieve transfer. This enables an agent, regardless of the task it was trained for, to adapt to perceptual distortions by first mapping the states in the new domain, such as grayscale images, back to the original domain where they appear in color, and then by acting with the same policy. Our procedure depends on an optimal transport formulation between trajectories in the two domains, shows promise in simple experimental settings, and resembles algorithms from imitation learning. 
Abhi Gupta · Ted Moskovitz · David AlvarezMelis · Aldo Pacchiano 🔗 


When is Agnostic Reinforcement Learning Statistically Tractable?
(
Poster
)
link »
We study the problem of agnostic PAC reinforcement learning (RL): given a policy class $\Pi$, how many rounds of interaction with an unknown MDP (with a potentially large state and action space) are required to learn an $\epsilon$suboptimal policy with respect to (\Pi)? Towards that end, we introduce a new complexity measure, called the spanning capacity, that depends solely on the set (\Pi) and is independent of the MDP dynamics. With a generative model, we show that the spanning capacity characterizes PAC learnability for every policy class $\Pi$. However, for online RL, the situation is more subtle. We show there exists a policy class $\Pi$ with a bounded spanning capacity that requires a superpolynomial number of samples to learn. This reveals a surprising separation for agnostic learnability between generative access and online access models (as well as between deterministic/stochastic MDPs under online access). On the positive side, we identify an additional sunflower structure which in conjunction with bounded spanning capacity enables statistically efficient online RL via a new algorithm called POPLER, which takes inspiration from classical importance sampling methods as well as recent developments for reachablestate identification and policy evaluation in rewardfree exploration.

Gene Li · Zeyu Jia · Alexander Rakhlin · Ayush Sekhari · Nati Srebro 🔗 


Improving and Generalizing FlowBased Generative Models with Minibatch Optimal Transport
(
Poster
)
link »
Continuous normalizing flows (CNFs) are an attractive generative modeling technique, but they have been held back by limitations in their simulationbased maximum likelihood training. We introduce the generalized \textit{conditional flow matching} (CFM) technique, a family of simulationfree training objectives for CNFs. CFM features a stable regression objective like that used to train the stochastic flow in diffusion models but enjoys the efficient inference of deterministic flow models. In contrast to both diffusion models and prior CNF training algorithms, CFM does not require the source distribution to be Gaussian or require evaluation of its density. A variant of our objective is optimal transport CFM (OTCFM), which creates simpler flows that are more stable to train and lead to faster inference, as evaluated in our experiments. Furthermore, OTCFM is the first method to compute dynamic OT in a simulationfree way. Training CNFs with CFM improves results on a variety of conditional and unconditional generation tasks, such as inferring single cell dynamics, unsupervised image translation, and Schrödinger bridge inference. 
Alexander Tong · Nikolay Malkin · Guillaume Huguet · Yanlei Zhang · Jarrid RectorBrooks · Kilian Fatras · Guy Wolf · Yoshua Bengio 🔗 


InContext DecisionMaking from Supervised Pretraining
(
Poster
)
link »
Large transformer models trained on diverse datasets have shown a remarkable ability to learn incontext, achieving high fewshot performance on tasks they were not explicitly trained to solve. In this paper, we study the incontext learning capabilities of transformers in decisionmaking problems, i.e., bandits and Markov decision processes. To do so, we introduce and study a supervised pretraining method where the transformer predicts an optimal action given a query state and an incontext dataset of interactions, across a diverse set of tasks. This procedure, while simple, produces an incontext algorithm with several surprising capabilities. We observe that the pretrained transformer can be used to solve a range of decisionmaking problems, exhibiting both exploration online and conservatism offline, despite not being explicitly trained to do so. It also generalizes beyond the pretraining distribution to new tasks and automatically adapts its decisionmaking strategies to unknown structure. Theoretically, we show the pretrained transformer can be viewed as an implementation of posterior sampling. We further leverage this connection to provide guarantees on its regret, and prove that it can learn a decisionmaking algorithm stronger than a source algorithm used to generate its pretraining data. These results suggest a promising yet simple path towards instilling strong incontext decisionmaking abilities in transformers. 
Jonathan Lee · Annie Xie · Aldo Pacchiano · Yash Chandak · Chelsea Finn · Ofir Nachum · Emma Brunskill 🔗 


Statistics estimation in neural network training: a recursive identification approach
(
Poster
)
link »
A common practice in minibatch neural network training is to estimate global statistics using exponential moving averages (EMA). However, such methods can be sensitive to the EMA decay parameter, which is typically set by hand. In this paper, we introduce Adaptive Linear State Estimation (ALiSE), an online method for adapting the parameters of a linear estimation model such as an EMA. Our work establishes a connection between parameter estimation methods in deep learning, including ALiSE, and recursive identification techniques in control theory. We apply ALiSE to a range of deep learning scenarios and show that it can learn sensible schedules for the EMA decay parameter. Compared to the naive EMA baseline, ALiSE leads to matching or accelerated convergence during training. 
Ruth Crasto · Xuchan Bao · Roger Grosse 🔗 


Learning to Optimize with Recurrent Hierarchical Transformers
(
Poster
)
link »
Learning to optimize (L2O) has received a lot of attention recently because of its potential to leverage data to outperform handdesigned optimization algorithms such as Adam. Typically, these learned optimizers are metalearned on optimization tasks to achieve rapid convergence. However, they can suffer from high metatraining costs and memory overhead. Recent attempts have been made to reduce the computational costs of these learned optimizers by introducing a hierarchy that enables them to perform most of the heavy computation at the tensor (layer) level rather than the parameter level. This not only leads to sublinear memory cost with respect to number of parameters, but also allows for a higher representation capacity for efficient learned optimization. To this end, we propose an efficient transformerbased learned optimizer which facilitates communication among tensors with selfattention and keeps track of optimization history with recurrence. We show that our optimizer learns to optimize better than strong learned optimizer baselines at a comparable memory overhead, thereby suggesting encouraging scaling trends. 
Abhinav Moudgil · Boris Knyazev · Guillaume Lajoie · Eugene Belilovsky 🔗 


FixedBudget Hypothesis Best Arm Identification: On the Information Loss in Experimental Design
(
Poster
)
link »
Experimental design plays a crucial role in evidencebased science with multiple treatment arms, such as online advertisements or medical treatments. This study addresses the task of identifying the best treatment arm, which has the highest expected outcome among multiple treatment arms We investigate the influence of available information regarding the distributions of treatment arms in experiments. In our experimental setup, we first designate a hypothetical ``best'' treatment arm and then conduct an experiment to verify whether this hypothetically best treatment arm is indeed the 'true' best treatment arm. Our null hypothesis posits that the hypothetical best treatment is not the actual best, and our objective is to minimize the likelihood of recommending other treatment arms when the null hypothesis is false; in other words, when the true best treatment arm is the same as the hypothetical best treatment. We demonstrate that the optimal experimental design significantly depends on knowledge about distributional information, examined through an informationtheoretic approach. Specifically, we discuss worstcase scenarios, characterized by a loss of distributional information, as circumstances when gaps between the expected outcomes of the best and suboptimal treatment arms convege to zero. After discussing asymptotic optimality, we propose an experimental design informed by the available information. 
Masahiro Kato · Masaaki Imaizumi · Takuya Ishihara · Toru Kitagawa 🔗 


Unbalanced Optimal Transport meets SlicedWasserstein
(
Poster
)
link »
Optimal transport (OT) has emerged as a powerful framework to compare probability measures, a fundamental task in many statistical and machine learning problems. Substantial advances have been made over the last decade in designing OT variants which are either computationally and statistically more efficient, or more robust to the measures/datasets to compare. Among them, sliced OT distances have been extensively used to mitigate optimal transport's cubic algorithmic complexity and curse of dimensionality. In parallel, unbalanced OT was designed to allow comparisons of more general positive measures, while being more robust to outliers. In this paper, we propose to combine these two concepts, namely slicing and unbalanced OT, to develop a general framework for efficiently comparing positive measures. We propose two new loss functions based on the idea of slicing unbalanced OT, and study their induced topology and statistical properties. We then develop a fast FrankWolfetype algorithm to compute these losses, and show that our methodology is modular as it encompasses and extends prior related work. We finally conduct an empirical analysis of our loss functions and methodology on both synthetic and real datasets, to illustrate their relevance and applicability. 
Thibault Sejourne · Clément Bonet · Kilian Fatras · Kimia Nadjahi · Nicolas Courty 🔗 


Improved sampling via learned diffusions
(
Poster
)
link »
Recently, a series of papers proposed deep learningbased approaches to sample from unnormalized target densities using controlled diffusion processes. In this work, we identify these approaches as special cases of the Schrödinger bridge problem, seeking the most likely stochastic evolution between a given prior distribution and the specified target, and propose the perspective from measures on path space as a unifying framework. The optimal controls of such entropyconstrained optimal transport problems can then be described by systems of partial differential equations and corresponding backward stochastic differential equations. Building on these optimality conditions and exploiting the path measure perspective, we obtain variational formulations of the respective approaches and recover the objectives which can be approached via gradient descent. Our formulations allow us to introduce losses different from the typically employed reverse KullbackLeibler divergence that is known to suffer from mode collapse. In particular, we propose the socalled logvariance loss, which exhibits favorable numerical properties and leads to significantly improved performance across all considered approaches. 
Julius Berner · Lorenz Richter · GuanHorng Liu 🔗 


Stability of MultiAgent Learning: Convergence in Network Games with Many Players
(
Poster
)
link »
The behaviour of multiagent learning in many player games has been shown to display complex dynamics outside of restrictive examples such as network zerosum games. In addition, it has been shown that convergent behaviour is less likely to occur as the number of players increase. To make progress in resolving this problem, we study QLearning dynamics and determine a sufficient condition for the dynamics to converge to a unique equilibrium in any network game. We find that this condition depends on the nature of pairwise interactions and on the network structure, but is explicitly independent of the total number of agents in the game. We evaluate this result on a number of representative network games and show that, under suitable network conditions, stable learning dynamics can be achieved with an arbitrary number of agents. 
Aamal Hussain · Dan Leonte · Francesco Belardinelli · Georgios Piliouras 🔗 


Limited Information Opponent Modeling
(
Poster
)
link »
The goal of opponent modeling is to model the opponent policy to maximize the reward of the main agent. Most prior works fail to effectively handle scenarios where opponent information is limited. To this end, we propose a Limited Information Opponent Modeling (LIOM) approach that extracts opponent policy representations across episodes using only selfobservations. LIOM introduces a novel policybased data augmentation method that extracts opponent policy representations offline via contrastive learning and incorporates them as additional inputs for training a general response policy. During online testing, LIOM dynamically responds to opponent policies by extracting opponent policy representations from recent historical trajectory data and combining them with the general policy. Moreover, LIOM ensures a lower bound on expected rewards through a balance between conservative and exploitation. Experimental results demonstrate that LIOM is able to accurately extract opponent policy representations even when the opponent's information is limited, and has a certain degree of generalization ability for unknown policies, outperforming existing opponent modeling algorithms. 
Yongliang Lv · Yan Zheng · jianye Hao 🔗 


Game Theoretic Neural ODE Optimizer
(
Poster
)
link »
In this work, we present a novel Game Theoretic Neural Ordinary Differential Equation (Neural ODE) optimizer based on the minimax Differential Dynamic Programming paradigm. As neural networks and neural ODEs tend to be vulnerable to attacks, and their predictions are fragile in the presence of adversarial examples, we aim to design a robust game theoretic optimizer based on principles of MinMax Optimal Control. By formulating Neural ODE optimization as a MinMax Optimal Control Problem, our proposed algorithm aims to enhance the robustness of neural networks against adversarial attacks by finding policies that perform well under worstcase scenarios. Leveraging recent advances in the interpretation of Neural ODE training through an Optimal Control Problem perspective, we extend recent second order optimization techniques to a game theoretic setting and adapt them to our proposed method. This allows our optimizer toefficiently handle the increased complexity stemming from the computation of double the amount of learnable parameters. The resulting optimizer, Game Theoretic SecondOrder Neural Optimizer (GTSONO), enables more effective exploration of the control policy space, leading to improved robustness against adversarial attacks. Experimental evaluations on benchmark datasets demonstrate the superiority of GTSONO compared to existing stateoftheart optimizers in terms of both performance and efficiency against stateoftheartadversarial defense methods. 
Panagiotis Theodoropoulos · GuanHorng Liu · Tianrong Chen · Evangelos Theodorou 🔗 


A neural RDE approach for continuoustime nonMarkovian stochastic control problems
(
Poster
)
link »
We propose a novel framework for solving continuoustime nonMarkovian stochastic optimal problems by means of neural rough differential equations (Neural RDEs) introduced in Morrill et al. (2021). NonMarkovianity naturally arises in control problems due to the time delay effects in the system coefficients or the driving noises, which leads to optimal control strategies depending explicitly on the historical trajectories of the system state. By modelling the control process as the solution of a Neural RDE driven by the state process, we show that the controlstate joint dynamics are governed by an uncontrolled, augmented Neural RDE, allowing for fast MonteCarlo estimation of the value function via trajectories simulation and memoryefficient backpropagation. We provide theoretical underpinnings for the proposed algorithmic framework by demonstrating that Neural RDEs serve as universal approximators for functions of random rough paths. Exhaustive numerical experiments on nonMarkovian stochastic control problems are presented, which reveal that the proposed framework is timeresolutioninvariant and achieves higher accuracy and better stability in irregular sampling compared to existing RNNbased approaches. 
Melker Höglund · Emilio Ferrucci · Camilo Hernández · Aitor Muguruza Gonzalez · Cristopher Salvi · Leandro SánchezBetancourt · Yufei Zhang 🔗 


On FirstOrder MetaReinforcement Learning with Moreau Envelopes
(
Poster
)
link »
MetaReinforcement Learning (MRL) is a promising framework for training agents that can quickly adapt to new environments and tasks. In this work, we study the MRL problem under the policy gradient formulation, where we propose a novel algorithm that uses Moreau envelope surrogate regularizers to jointly learn a metapolicy that is adjustable to the environment of each individual task. Our algorithm, called Moreau Envelope MetaReinforcement Learning (MEMRL), learns a metapolicy that can adapt to a distribution of tasks by efficiently updating the policy parameters using a combination of gradientbased optimization and Moreau Envelope regularization. Moreau Envelopes provide a smooth approximation of the policy optimization problem, which enables us to apply standard optimization techniques and converge to an appropriate stationary point. We provide a detailed analysis of the MEMRL algorithm, where we show a sublinear convergence rate to a firstorder stationary point for nonconvex policy gradient optimization. We finally show the effectiveness of MEMRL on a multitask 2Dnavigation problem. 
Mohammad Taha Toghani · Sebastian PerezSalazar · Cesar Uribe 🔗 


Vector Quantile Regression on Manifolds
(
Poster
)
link »
Quantile regression (QR) is a statistical tool for distributionfree estimation of conditional quantiles of a target variable given explanatory features.QR is limited by the assumption that the target distribution is univariate and defined on an Euclidean domain.Although the notion of quantiles was recently extended to multivariate distributions,QR for multivariate distributions on manifolds remains underexplored, even though many important applications inherently involve data distributed on, e.g., spheres (climate measurements), tori (dihedral angles in proteins), or Lie groups (attitude in navigation).By leveraging optimal transport theory and the notion of $c$concave functions, we meaningfully define conditional vector quantile functions of highdimensional variables on manifolds (MCVQFs).Our approach allows for quantile estimation, regression, and computation of conditional confidence sets.We demonstrate the approach's efficacy and provide insights regarding the meaning of nonEuclidean quantiles through preliminary synthetic data experiments.

Marco Pegoraro · Sanketh Vedula · Aviv A. Rosenberg · Irene Tallini · Emanuele Rodola · Alexander Bronstein 🔗 


Learning with Learning Awareness using MetaValues
(
Poster
)
link »
Gradientbased learning in multiagent systems is difficult because the gradient derives from a firstorder model which does not account for the interaction between agents' learning processes.LOLA (Foerster at al, 2018) accounts for this by differentiating through one step of optimization.We extend the ideas of LOLA and develop a fullygeneral valuebased approach to optimization.At the core is a function we call the metavalue, which at each point in jointpolicy space gives for each agent a discounted sum of its objective over future optimization steps.We argue that the gradient of the metavalue gives a more reliable improvement direction than the gradient of the original objective, because the metavalue derives from empirical observations of the effects of optimization.We show how the metavalue can be approximated by training a neural network to minimize TD error along optimization trajectories in which agents follow the gradient of the metavalue.We analyze the behavior of our method on the Logistic Game (Letcher 2018) and on the Iterated Prisoner's Dilemma. 
Tim Cooijmans · Milad Aghajohari · Aaron Courville 🔗 


Kernel Mirror Prox and RKHS Gradient Flow for Mixed Functional Nash Equilibrium
(
Poster
)
link »
The theoretical analysis of machine learning algorithms, such as deep generative modeling, motivates multiple recent works on the Mixed Nash Equilibrium (MNE) problem.Different from MNE,this paper formulates theMixed Functional Nash Equilibrium (MFNE),which replaces one of the measure optimization problems with optimization over a class of dual functions, e.g., the reproducing kernel Hilbert space (RKHS) in the case of Mixed Kernel Nash Equilibrium (MKNE).We show that our MFNE and MKNE framework form the backbones that govern several existing machine learning algorithms, such as implicit generative models, distributionally robust optimization (DRO), and Wasserstein barycenters.To model the infinitedimensional continuouslimit optimization dynamics,we propose the Interacting WassersteinKernel Gradient Flow, which includes the RKHS flow that is much less common than the Wasserstein gradient flow but enjoys a much simpler convexity structure.Timediscretizing this gradient flow, we propose a primaldual kernel mirror prox algorithm, which alternates between a dual step in the RKHS, and a primal step in the space of probability measures.We then provide the first unified convergence analysis of our algorithm for this class of MKNE problems,which establishes a convergence rate of $O(1/N)$ in the deterministic case and $O(1/\sqrt{N})$ in the stochastic case.As a case study, we apply our analysis to DRO, providing the first primaldual convergence analysis for DRO with probabilitymetric constraints.

Pavel Dvurechenskii · JiaJie Zhu 🔗 


SimulationFree Schrödinger Bridges via Score and Flow Matching
(
Poster
)
link »
We present simulationfree score and flow matching ([SF]$^2$M), a simulationfree objective for inferring stochastic dynamics given unpaired source and target samples drawn from arbitrary distributions. Our method generalizes both the scorematching loss used in the training of diffusion models and the recently proposed flow matching loss used in the training of continuous normalizing flows. [SF]$^2$M interprets continuoustime stochastic generative modeling as a Schr\"odinger bridge (SB) problem. It relies on static entropyregularized optimal transport, or a minibatch approximation, to efficiently learn the SB without simulating the learned stochastic process. We find that [SF]$^2$M is more efficient and gives more accurate solutions to the SB problem than simulationbased methods from prior work. Finally, we apply [SF]$^2$M to the problem of learning cell dynamics from snapshot data. Notably, [SF]$^2$M is the first method to accurately model cell dynamics in high dimensions and can recover known gene regulatory networks from simulated data.

Alexander Tong · Nikolay Malkin · Kilian Fatras · Lazar Atanackovic · Yanlei Zhang · Guillaume Huguet · Guy Wolf · Yoshua Bengio 🔗 


Latent Space Editing in TransformerBased Flow Matching
(
Poster
)
link »
This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformerbased UViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and highquality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, we call $u$space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive stepsize ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving finegrained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content.We will provide our source code and include it in the appendix.

Tao Hu · David Zhang · Meng Tang · Pascal Mettes · Deli Zhao · Cees Snoek 🔗 


Structured State Space Models for InContext Reinforcement Learning
(
Poster
)
link »
Structured state space sequence (S4) models have recently achieved stateoftheart performance on longrange sequence modeling tasks. These models also have fast inference speeds and parallelisable training, making them potentially useful in many reinforcement learning settings. We propose a modification to a variant of S4 that enables us to initialise and reset the hidden state in parallel, allowing us to tackle reinforcement learning tasks. We show that our modified architecture runs asymptotically faster than Transformers and performs better than LSTM models on a simple memorybased task. Then, by leveraging the model’s ability to handle longrange sequences, we achieve strong performance on a challenging metalearning task in which the agent is given a randomlysampled continuous control environment, combined with a randomlysampled linear projection of the environment's observations and actions. Furthermore, we show the resulting model can adapt to outofdistribution heldout tasks. Overall, the results presented in this paper suggest that the S4 models are a strong contender for the default architecture used for incontext reinforcement learning. 
Christopher Lu · Yannick Schroecker · Albert Gu · Emilio Parisotto · Jakob Foerster · Satinder Singh · Feryal Behbahani 🔗 


Maximum State Entropy Exploration using Predecessor and Successor Representations
(
Poster
)
link »
Animals have a developed ability to explore that aids them in important tasks such as locating food, exploring for shelter, and finding misplaced items. These exploration skills necessarily track where they have been so that they can plan for finding items with relative efficiency. Contemporary exploration algorithms often learn a less efficient exploration strategy because they either condition only on the current state or simply rely on making random openloop exploratory moves. In this work, we propose $\eta\psi$Learning, a method to learn efficient exploratory policies by conditioning on past episodic experience to make the next exploratory move. Specifically, $\eta\psi$Learning learns an exploration policy that maximizes the entropy of the state visitation distribution of a single trajectory. Furthermore, we demonstrate how variants of the predecessor representation and successor representations can be combined to predict the state visitation entropy. Our experiments demonstrate the efficacy of the proposed algorithm to strategically explore the environment and maximize the state coverage with limited samples.

Arnav Kumar Jain · Lucas Lehnert · Irina Rish · Glen Berseth 🔗 


PACBayesian Bounds for Learning LTIss systems with Input from Empirical Loss
(
Poster
)
link »
In this paper we derive a Probably Approximately Correct(PAC)Bayesian error bound for linear timeinvariant (LTI) stochastic dynamical systems with inputs. Such boundsare widespread in machine learning, and they are useful for characterizing the predictive power of models learned from finitely many data points. In particular, the bound derived in this paper relatesfuture average prediction errors with the prediction error generated by the model on the data used for learning.In turn, this allows us to provide finitesample error bounds fora wide class of learning/system identification algorithms. Furthermore, as LTI systems are a subclass of recurrent neuralnetworks (RNNs), these error bounds could be a first step towards PACBayesian bounds for RNNs. 
Deividas Eringis · john leth · Rafal Wisniewski · ZhengHua Tan · Mihaly Petreczky 🔗 


Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding
(
Poster
)
link »
A prominent challenge of offline reinforcement learning (RL) is the issue of hidden confounding: unobserved variables may influence both the actions taken by the agent and the observed outcomes. Hidden confounding can compromise the validity of any causal conclusion drawn from data and presents a major obstacle to effective offline RL. In the present paper, we tackle the problem of hidden confounding in the nonidentifiable setting. We propose a definition of uncertainty due to hidden confounding bias, termed delphic uncertainty, which uses variation over world models compatible with the observations, and differentiate it from the wellknown epistemic and aleatoric uncertainties. We derive a practical method for estimating the three types of uncertainties, and construct a pessimistic offline RL algorithm to account for them. Our method does not assume identifiability of the unobserved confounders, and attempts to reduce the amount of confounding bias. We demonstrate through extensive experiments and ablations the efficacy of our approach on a sepsis management benchmark, as well as on electronic health records. Our results suggest that nonidentifiable hidden confounding bias can be mitigated to improve offline RL solutions in practice. 
Alizée Pace · Hugo Yèche · Bernhard Schölkopf · Gunnar Ratsch · Guy Tennenholtz 🔗 


Preventing Reward Hacking with Occupancy Measure Regularization
(
Poster
)
link »
Reward hacking occurs when an agent exploits its specified reward function to behave in undesirable or unsafe ways. Aside from better alignment between the specified reward function and the system designer's intentions, a more feasible proposal to prevent reward hacking is to regularize the learned policy to some safe baseline. Current research suggests that regularizing the learned policy's action distributions to be more similar to those of a safe policy can mitigate reward hacking; however, this approach fails to take into account the disproportionate impact that some actions have on the agent’s state. Instead, we propose a method of regularization based on occupancy measures, which capture the proportion of time each policy is in a particular stateaction pair during trajectories. We show theoretically that occupancybased regularization avoids many drawbacks of action distributionbased regularization, and we introduce an algorithm called ORPO to practically implement our technique. We then empirically demonstrate that occupancy measurebased regularization is superior in both a simple gridworld and a more complex autonomous vehicle control environment. 
Cassidy Laidlaw · Shivam Singhal · Anca Dragan 🔗 


Regret Bounds for Risksensitive Reinforcement Learning with Lipschitz Dynamic Risk Measures
(
Poster
)
link »
We study finite episodic Markov decision processes incorporating dynamic risk measures to capture risk sensitivity. To this end, we present two modelbased algorithms applied to \emph{Lipschitz} dynamic risk measures, a wide range of risk measures that subsumes spectral risk measure, optimized certainty equivalent, and distortion risk measures, among others. We establish both regret upper bounds and lower bounds. Notably, our upper bounds demonstrate optimal dependencies on the number of actions and episodes while reflecting the inherent tradeoff between risk sensitivity and sample complexity. Additionally, we substantiate our theoretical results through numerical experiments. 
Hao Liang · ZhiQuan Luo 🔗 


AbODE: Ab initio antibody design using conjoined ODEs
(
Poster
)
link »
Antibodies are Yshaped proteins that neutralize pathogens and constitute the core of our adaptive immune system. De novo generation of new antibodies that target specific antigens holds the key to accelerating vaccine discovery. However, this codesign of the amino acid sequence and the 3D structure subsumes and accentuates, some central challenges from multiple tasks, including protein folding (sequence to structure), inverse folding (structure to sequence), and docking (binding). We strive to surmount these challenges with a new generative model AbODE that extends graph PDEs to accommodate both contextual information and external interactions. Unlike existing approaches, AbODE uses a single round of fullshot decoding, and elicits continuous differential attention that encapsulates, and evolves with, latent interactions within the antibody as well as those involving the antigen. We unravel fundamental connections between AbODE and temporal networks as well as graphmatching networks. The proposed model significantly outperforms existing methods on standard metrics across benchmarks. 
Yogesh Verma · Markus Heinonen · Vikas K Garg 🔗 


Randomized methods for computing optimal transport without regularization and their convergence analysis
(
Poster
)
link »
The optimal transport (OT) problem can be reduced to a linear programming (LP) problem through discretization. In this paper, we introduce the random block coordinate descent (RBCD) methods to directly solve this LP problem. Our approach involves restricting the potentially largescale optimization problem to small LP subproblems constructed via randomly chosen working sets. By using a random GaussSouthwell$q$ rule to select these working sets, we equip the vanilla version of ($\bf \text{RBCD}_0$) with almost sure convergence and a linear convergence rate to solve general standard LP problems. To further improve the efficiency of the ($\bf \text{RBCD}_0$) method, we explore the special structure of constraints in the OT problems and propose several approaches for refining the random working set selection and accelerating the vanilla method. Our preliminary numerical experiments demonstrate that the accelerated random block coordinate descent ($\bf \text{ARBCD}$) method is comparable to Sinkhorn's algorithm when seeking solutions with relatively high accuracy, and offers the advantage of saving memory.

Yue Xie · Zhongjian Wang · Zhiwen Zhang 🔗 


Sublinear Regret in Adaptive Model Predictive Control
(
Poster
)
link »
We consider the problem of adaptive Model Predictive Control (MPC) for uncertain linearsystems with additive disturbances and with state and input constraints. We present STTMPC (SelfTuning Tubebased Model Predictive Control), an online algorithm that combines the certaintyequivalence principle and polytopic tubes. Specifically, at any given step, STTMPC infers the system dynamics using the Least Squares Estimator (LSE), and applies a controller obtained by solving an MPC problem using these estimates. The use of polytopic tubes is so that, despite the uncertainties, state and input constraints are satisfied, and recursivefeasibility and asymptotic stability hold. In this work, we analyze the regret of the algorithm, when compared to an oracle algorithm initially aware of the system dynamics. We establish that STTMPC expected regret does not exceed $O(T^{1/2 + \epsilon})$, where $\epsilon \in (0,1)$ is a design parameter tuning the persistent excitation component of the algorithm. Our result relies on a recently proposed exponential decay of sensitivity property and, to the best of our knowledge, is the first of its kind in this setting. We illustrate the performance of our algorithm using a simple numerical example.

Damianos Tranos · Alexandre Proutiere 🔗 


Breaking the Curse of Multiagents in a Large State Space: RL in Markov Games with Independent Linear Function Approximation
(
Poster
)
link »
We propose a new model, \emph{independent linear Markov game}, for multiagent reinforcement learning with a large state space and a large number of agents.This is a class of Markov games with \emph{independent} linear function approximation, where each agent has its own function approximation for the stateaction value functions that are {\it marginalized} by other players' policies. We design new algorithms for learning the Markov coarse correlated equilibria (CCE) and Markov correlated equilibria (CE) with sample complexity bounds that only scale polynomially with \emph{each agent's own function class complexity}, thus breaking the curse of multiagents. In contrast, existing works for Markov games with function approximation have sample complexity bounds scale with the size of the \emph{joint action space} when specialized to the canonical tabular Markov game setting, which is exponentially large in the number of agents. Our algorithms rely on two key technical innovations: (1) utilizing policy replay to tackle {\it nonstationarity} incurred by multiple agents and the use of function approximation; (2) separating learning Markov equilibria and exploration in the Markov games, which allows us to use the fullinformation noregret learning oracle instead of the stronger banditfeedback noregret learning oracle used in the tabular setting. Furthermore, we propose an iterativebestresponse type algorithm that can learn pure Markov Nash equilibria in independent linear Markov potential games, with applications in learning in congestion games.In the tabular case, by adapting the policy replay mechanism for independent linear Markov games, we propose an algorithm with $\widetilde{O}(\epsilon^{2})$ sample complexity to learn Markov CCE, which improves the stateoftheart result $\widetilde{O}(\epsilon^{3})$ in \cite{daskalakis2022complexity}, where $\epsilon$ is the desired accuracy, and also significantly improves other problem parameters. Furthermore, we design the first provably efficient algorithm for learning Markov CE that breaks the curse of multiagents.

Qiwen Cui · Kaiqing Zhang · Simon Du 🔗 


Offline GoalConditioned RL with Latent States as Actions
(
Poster
)
link »
In the same way that unsupervised pretraining has become the bedrock for computer vision and NLP, goalconditioned RL might provide a similar strategy for making use of vast quantities of unlabeled (rewardfree) data. However, building effective algorithms for goalconditioned RL, ones that can learn directly from offline data, is challenging because it is hard to accurately estimate the exact state value of reaching faraway goals. Nonetheless, goalreaching problems exhibit structure – reaching a distant goal entails visiting some closer states (or representations thereof) first. Importantly, it is easier to assess the effect of actions on getting to these closer states. Based on this idea, we propose a hierarchical algorithm for goalconditioned RL from offline data. Using one actionfree value function, we learn two policies that allow us to exploit this structure: a highlevel policy that predicts (a representation of) a waypoint, and a lowlevel policy that predicts the action for reaching this waypoint. Through analysis and didactic examples, we show how this hierarchical decomposition makes our method robust to noise in the estimated value function. We then apply our method to offline goalreaching benchmarks, showing that our method can solve longhorizon tasks that stymie prior methods, can scale to highdimensional image observations, and can readily make use of actionfree data. 
Seohong Park · Dibya Ghosh · Benjamin Eysenbach · Sergey Levine 🔗 


Variational quantum dynamics of twodimensional rotor models
(
Poster
)
link »
We present a simulation method for the dynamics of continuousvariable quantum manybody systems based on neuralnetwork quantum states. The focus is put on dynamics of experimentally relevant twodimensional quantum rotors. We simulate previously unreachable system sizes and simulation times using a neuralnetwork trial wavefunction in a continuous basis and using modern sampling approaches based on Hamiltonian Monte Carlo. The method is demonstrated to be able to access quantities like the return probability and vorticity oscillations after a quantum quench in twodimensional systems of up to 64 (8 $\times$ 8) coupled rotors. Our approach can be used for accurate nonequilibrium simulations of continuous systems at previously unexplored system sizes and evolution times, bridging the gap between simulation and experiment.

Matija Medvidović · Dries Sels 🔗 


Sample Complexity of Hierarchical Decompositions in Markov Decision Processes
(
Poster
)
link »
Hierarchical Reinforcement Learning (HRL) algorithms perform planning at multiple levels of abstraction. Algorithms that leverage states or temporal abstractions have empirically demonstrated a gain in sample efficiency. Yet, the basis of those efficiency gains is not fully understood and we still lack theoreticallygrounded design rules to implement HRL algorithms. Here, we derive a lower bound on the sample complexity for the proposed class of goalconditioned HRL algorithms (such as Dot2Dot \cite{beyret2019dot}) that inspires a novel Qlearning algorithm and highlights the relationship between the properties of the decomposition and the sample complexity. Specifically, the proposed lower bound on the sample complexity of such HRL algorithms allows to quantify the benefits of hierarchical decomposition. These theoretical findings guide the formulation of a simple Qlearningtype algorithm that leverages goal hierarchical decomposition. We then empirically validate our lower bound by investigating the sample complexity of the proposed hierarchical algorithm on a spectrum of tasks. Our tasks were designed to allow us to dial up or down their complexity over multiple orders of magnitude. Our theoretical and algorithmic results provide a clear step towards understanding the foundational question of quantifying the efficiency gains induced by hierarchies in reinforcement learning. 
Arnaud Robert · Ciara PikeBurke · Aldo Faisal 🔗 


Boosting Offpolicy RL with Policy Representation and Policyextended Value Function Approximator
(
Poster
)
link »
Offpolicy Reinforcement Learning (RL) is fundamental to realizing intelligent decisionmaking agents by trial and error.The most notorious issue of offpolicy RL is known as Deadly Triad, i.e., Bootstrapping, Function Approximation, and Offpolicy Learning.Despite recent advances in bootstrapping algorithms with better bias control, improvements on the latter two factors are relatively less studied. In this paper, we propose a general offpolicy RL algorithm based on policy representation and policyextended value function approximator (PeVFA). Orthogonal to better bootstrapping, our improvement is twofold. On one hand, PeVFA's nature in fitting the value functions of multiple policies according to corresponding lowdimensional policy representation offers preferable function approximation with less interference and better generalization. On the other hand, PeVFA and policy representation allow to perform offpolicy learning in a more general and sufficient manner. Specifically, we perform additional value learning for proximal historical policies along the learning process.This drives the value generalization from learned policies and in turn, leads to more efficient learning. We evaluate our algorithms on continuous control tasks and the empirical results demonstrate consistent improvements in terms of efficiency and stability. 
Min Zhang · Jianye Hao · Hongyao Tang · Yan Zheng 🔗 


Guide Your Agent with Adaptive Multimodal Rewards
(
Poster
)
link »
Recent work have shown that incorporating pretrained multimodal representations can enhance the ability of an instructionfollowing agent to generalize to unseen situations. Yet training such agents often requires a dataset consisting of diverse demonstrations, which may not be available for target domains and incur a huge cost to collect. In this paper, we instead propose to utilize the knowledge captured within large visionlanguage models for improving the generalization capability of control agents. To this end, we present Multimodal Reward Decision Transformer (MRDT), a simple yet effective method that uses the visualtext alignment score as a reward. This reward, which adapts based on the progress towards achieving the textspecified goals, is used to train a returnconditioned policy that guides the agent towards the desired goals. We also introduce a finetuning scheme that adapts pretrained multimodal models using indomain data to improve the quality of rewards. Our experiments demonstrate that MRDT significantly improves generalization performance in test environments with unseen goals. Moreover, we introduce new metrics for evaluating the quality of multimodal rewards and show that generalization performance increases as the quality of rewards improves. 
Changyeon Kim · Younggyo Seo · Hao Liu · Lisa Lee · Jinwoo Shin · Honglak Lee · Kimin Lee 🔗 


Neural Optimal Transport with Lagrangian Costs
(
Poster
)
link »
Computational efforts in optimal transport traditionally revolvearound the squaredEuclidean cost. In this work, we choose toinvestigate the optimal transport problem between probability measureswhen the underlying metric space is nonEuclidean, or when the costfunction is understood to satisfy a least action principle,also known as a Lagrangian cost. These two generalizations are useful when connecting observations from a physical system, where the transport dynamics are influencedby the geometry of the system, such as obstacles, and allowspractitioners to incorporate a priori knowledge of theunderlying system. Examples include barriers for transport, orenforcing a certain geometry, i.e., paths must be circular.We demonstrate the effectiveness of this formulation on existingsynthetic examples in the literature, where we solve the optimaltransport problems in the absence of regularization, which is novel inthe literature. Our contributions are of computational interest, where we demonstrate the ability to efficiently compute geodesics and amortize splinebased paths. We demonstrate the effectiveness of this formulation on existing synthetic examples in the literature, where we solve the optimal transport problems in the absence of regularization. 
AramAlexandre Pooladian · Carles Domingo i Enrich · Ricky T. Q. Chen · Brandon Amos 🔗 


Deep Equilibrium Based Neural Operators for SteadyState PDEs
(
Poster
)
link »
Datadriven machine learning approaches are being increasingly used to solve partial differential equations (PDEs). They have shown particularly striking successes when training an operator, which takes as input a PDE in some family, and outputs its solution. However, the architectural design space, especially given structural knowledge of the PDE family of interest, is still poorly understood. We seek to remedy this gap by studying the benefits of weighttied neural network architectures for steadystate PDEs. To achieve this, we first demonstrate that the solution of most steadystate PDEs can be expressed as a fixed point of a nonlinear operator. Motivated by this observation, we propose FNODEQ, a deep equilibrium variant of the FNO architecture that directly solves for the solution of a steadystate PDE as the infinitedepth fixed point of an implicit operator layer using a blackbox root solver and differentiates analytically through this fixed point resulting in $\mathcal{O}(1)$ training memory. Our experiments indicate that FNODEQbased architectures outperform FNObased baselines with $4\times$ the number of parameters in predicting the solution to steadystate PDEs such as Darcy Flow and steadystate incompressible NavierStokes. Finally, we show FNODEQ is more robust when trained with datasets with more noisy observations than the FNObased baselines, demonstrating the benefits of using appropriate inductive biases in architectural design for different neural network based PDE solvers. Further, we show a universal approximation result that demonstrates that FNODEQ can approximate the solution to any steadystate PDE that can be written as a fixed point equation.

Tanya Marwah · Ashwini Pokle · Zico Kolter · Zachary Lipton · Jianfeng Lu · Andrej Risteski 🔗 


Look Beneath the Surface: Exploiting Fundamental Symmetry for SampleEfficient Offline RL
(
Poster
)
link »
Offline reinforcement learning (RL) offers an appealing approach to realworld tasks by learning policies from precollected datasets without interacting with the environment. However, the performance of existing offline RL algorithms heavily depends on the scale and stateaction space coverage of datasets. Realworld data collection is often expensive and uncontrollable, leading to small and narrowly covered datasets and posing significant challenges for practical deployments of offline RL. In this paper, we provide a new insight that leveraging the fundamental symmetry of system dynamics can substantially enhance offline RL performance under small datasets. Specifically, we propose a Timereversal symmetry (Tsymmetry) enforced Dynamics Model (TDM), which establishes consistency between a pair of forward and reverse latent dynamics. TDM provides both wellbehaved representations for small datasets and a new reliability measure for OOD samples based on compliance with the Tsymmetry. These can be readily used to construct a new offline RL algorithm (TSRL) with less conservative policy constraints and a reliable latent space data augmentation procedure. Based on extensive experiments, we find TSRL achieves great performance on small benchmark datasets with as few as 1\% of the original samples, which significantly outperforms the recent offline RL algorithms in terms of data efficiency and generalizability. 
PENG CHENG · Xianyuan Zhan · Zhihao Wu · Wenjia Zhang · Youfang Lin · Shou cheng Song · Han Wang 🔗 


Nonlinear Wasserstein Distributionally Robust Optimal Control
(
Poster
)
link »
This paper presents a novel approach to addressing the distributionally robust nonlinear model predictive control (DRNMPC) problem. Current literature primarily focuses on the static Wasserstein distributionally robust optimal control problem with a prespecified ambiguity set of uncertain system states. Although a few studies have tackled the dynamic setting, a practical algorithm remains elusive. To bridge this gap, we introduce a DRNMPC scheme that dynamically controls the propagation of ambiguity, based on the constrained iterative linear quadratic regulator. The theoretical results are also provided to characterize the stochastic error reachable sets under ambiguity. We evaluate the effectiveness of our proposed iterative DRMPC algorithm by comparing the closedloop performance of feedback and openloop on a massspring system, and demonstrate in numerical experiments that our algorithm controls the propagated Wasserstein ambiguity. 
Zhengang Zhong · JiaJie Zhu 🔗 


Trajectory Generation, Control, and Safety with Denoising Diffusion Probabilistic Models
(
Poster
)
link »
We present a framework for safetycritical optimal control of physical systems based on denoising diffusion probabilistic models (DDPMs). The technology of control barrier functions (CBFs), encoding desired safety constraints, is used in combination with DDPMs to plan actions by iteratively denoising trajectories through a CBFbased guided sampling procedure. At the same time, the generated trajectories are also guided to maximize a future cumulative reward representing a specific task to be optimally executed.The proposed scheme can be seen as an offline and modelbased reinforcement learning algorithm resembling in its functionalities a modelpredictive control optimization scheme with receding horizon in which the selected actions lead to optimal and safe trajectories. 
Nicolò Botteghi · Federico Califano · University Twente · Christoph Brune 🔗 


Coupled Gradient Flows for Strategic NonLocal Distribution Shift
(
Poster
)
link »
We propose a novel framework for analyzing the dynamics of distribution shift in realworld systems that captures the feedback loop between learning algorithms and the distributions on which they are deployed. Prior work largely models feedbackinduced distribution shift as adversarial or via an overly simplistic distributionshift structure. In contrast, we propose a coupled partial differential equation model that captures finegrained changes in the distribution over time by accounting for complex dynamics that arise due to strategic responses to algorithmic decisionmaking, nonlocal endogenous population interactions, and other exogenous sources of distribution shift. We consider two common settings in machine learning: cooperative settings with information asymmetries, and competitive settings where a learner faces strategic users. For both of these settings, when the algorithm retrains via gradient descent, we prove asymptotic convergence of the retraining procedure to a steadystate, both in finite and in infinite dimensions, obtaining explicit rates in terms of the model parameters. To do so we derive new results on the convergence of coupled PDEs that extends what is known on multispecies systems. Empirically, we show that our approach captures welldocumented forms of distribution shifts like polarization and disparate impacts that simpler models cannot capture. 
Lauren Conger · Franca Hoffmann · Eric Mazumdar · Lillian Ratliff 🔗 


Efficient RL with Impaired Observability: Learning to Act with Delayed and Missing State Observations
(
Poster
)
link »
In realworld reinforcement learning (RL) systems, various forms of impaired observability can complicate matters. These situations arise when an agent is unable to observe the most recent state of the system due to latency or lossy channels, yet the agent must still make realtime decisions. This paper introduces a theoretical investigation into efficient RL in control systems where agents must act with delayed and missing state observations. We establish nearoptimal regret bounds, of the form $\tilde{\mathcal{O}}(\sqrt{{\rm poly}(H) SAK})$, for RL in both the delayed and missing observation settings. Despite impaired observability posing significant challenges to the policy class and planning, our results demonstrate that learning remains efficient, with the regret bound optimally depending on the stateaction size of the original system. Additionally, we provide a characterization of the performance of the optimal policy under impaired observability, comparing it to the optimal value obtained with full observability.

Minshuo Chen · Yu Bai · H. Vincent Poor · Mengdi Wang 🔗 


LEAD: MinMax Optimization from a Physical Perspective
(
Poster
)
link »
Adversarial formulations have rekindled interest in twoplayer minmax games. A central obstacle in the optimization of such games is the rotational dynamics that hinder their convergence. In this paper, we show that game optimization shares dynamic properties with particle systems subject to multiple forces, and one can leverage tools from physics to improve optimization dynamics. Inspired by the physical framework, we propose LEAD, an optimizer for minmax games. Next, using Lyapunov stability theory from dynamical systems as well as spectral analysis, we study LEAD’s convergence properties in continuous and discrete time settings for a class of quadratic minmax games to demonstrate linear convergence to the Nash equilibrium. Finally, we empirically evaluate our method on synthetic setups and CIFAR10 image generation to demonstrate improvements in GAN training. 
Reyhane Askari Hemmat · Amartya Mitra · Guillaume Lajoie · Ioannis Mitliagkas 🔗 


Stochastic Linear Bandits with Unknown Safety Constraints and Local Feedback
(
Poster
)
link »
In many realworld decisionmaking tasks, e.g. clinical trials, the agents must satisfy a diverse set of unknown safety constraints at all times while getting feedback only on the safety constraints relevant to the chosen action, e.g. the ones close to violation. In this work, we study stochastic linear bandits with such unknown safety constraints and local safety feedback. The agent's goal is to maximize the cumulative reward while satisfying \textit{multiple unknown affine or nonlinear} safety constraints. At each time step, the agent receives noisy feedback on a particular safety constraint \textit{only if} the chosen action belongs to the associated constraint set, i.e. local safety feedback. For this setting, we design upper confidence bound and Thompson Samplingbased algorithms. In the design of these algorithms, we carefully prescribe an additional exploration incentive that guarantees the selection of highreward actions that are also safe and ensures sufficient exploration in the relevant constraint sets to recover the optimal safe action. We show that for $M$ distinct constraints, both of these algorithms attain $\tilde{\mathcal{O}}(\sqrt{MT})$ regret after $T$ time steps without any safety violations. We empirically study the performance of the proposed algorithms under various safety constraints and with a realworld credit dataset. We show that both algorithms safely explore and quickly recover the optimal safe actions.

Nithin Varma · Sahin Lale · Anima Anandkumar 🔗 


Distributional Distance Classifiers for GoalConditioned Reinforcement Learning
(
Poster
)
link »
What does it mean to find the shortest path in stochastic environments, where every strategy has a nonzero probability of failing? At the core of this question is a conflict between two seeminglynatural notions of planning: maximizing the probability of reaching a goal state, and minimizing the expected number of steps to reach that goal state. Reinforcement learning (RL) methods based on minimizing the steps to a goal make an implicit assumption: that the goal is always reached, at least within some finite horizon. This assumption is violated in practical settings and can lead to very suboptimal strategies. In this paper, we bridge the gap between these two notions of planning by estimating the probability of reaching the goal at different horizons. This is not the same as estimating the distance to the goal  rather, probabilities convey uncertainty in ever reaching the goal at all. We then propose an algorithm for estimating these probabilities. The update rule resembles distributional RL but is used to solve (rewardfree) goalreaching tasks rather than (single) rewardmaximization tasks. Taken together, we believe that our results provide a cogent framework for thinking about probabilities and distances in stochastic settings, along with a practical and effective algorithm for solving goalreaching problems in many settings. 
Ravi Tej Akella · Benjamin Eysenbach · Jeff Schneider · Ruslan Salakhutdinov 🔗 


Taylorformer: Probabalistic Modelling for Random Processes including Time Series
(
Poster
)
link »
We propose the Taylorformer for random processes such as time series. Its two key components are: 1) the LocalTaylor wrapper which adapts Taylor approximations (used in dynamical systems) for use in neural networkbased probabilistic models, and 2) the MHAX attention block which makes predictions in a way inspired by how Gaussian Processes' mean predictions are linear smoothings of contextual data. Taylorformer outperforms the stateoftheart in terms of loglikelihood on 5/6 classic Neural Process tasks such as metalearning 1D functions, and has at least a 14\% MSE improvement on forecasting tasks, including electricity, oil temperatures and exchange rates. Taylorformer approximates a consistent stochastic process and provides uncertaintyaware predictions. Our code is provided in the supplementary material. 
Omer Nivron · Raghul Parthipan · Damon Wischik 🔗 


Policy Gradient Algorithms Implicitly Optimize by Continuation
(
Poster
)
link »
Direct policy optimization in reinforcement learning is usually solved with policygradient algorithms, which optimize policy parameters via stochastic gradient ascent. This paper provides a new theoretical interpretation and justification of these algorithms. First, we formulate direct policy optimization in the optimization by continuation framework. The latter is a framework for optimizing nonconvex functions where a sequence of surrogate objective functions, called continuations, are locally optimized. Second, we show that optimizing affine Gaussian policies and performing entropy regularization can be interpreted as implicitly optimizing deterministic policies by continuation. Based on these theoretical results, we argue that exploration in policygradient algorithms consists in computing a continuation of the return of the policy at hand, and that the variance of policies should be historydependent functions adapted to avoid local extrema rather than to maximize the return of the policy. 
Adrien Bolland · Gilles Louppe · Damien Ernst 🔗 


Randomly Coupled Oscillators for Time Series Processing
(
Poster
)
link »
We investigate a physicallyinspired recurrent neural network derived from a continuoustime ODE modelling a network of coupled oscillators. Enthralled by the Reservoir Computing paradigm, we introduce the Randomly Coupled Oscillators (RCO) model, which leverages an untrained recurrent component with a smart random initialization. We analyse the architectural bias of RCO and its neural dynamics. We derive sufficient conditions for the model to have a unique asymptotically uniformly stable inputdriven solution. We also derive necessary conditions for stability, that permit to push the system of oscillators slightly beyond the edge of stability. We empirically assess the effectiveness of RCO in terms of its stability and its longterm memory properties. We compare its performance against both fullytrained and randomized recurrent models in a number of time series processing tasks. We find that RCO provides an excellent tradeoff between robust longterm memory properties and ability to predict the behavior of nonlinear, chaotic systems. 
Andrea Ceni · Andrea Cossu · Jingyue Liu · Maximilian Stölzle · Cosimo Della Santina · Claudio Gallicchio · Davide Bacciu 🔗 


On a Connection between Differential Games, Optimal Control, and Energybased Models for MultiAgent Interactions
(
Poster
)
link »
Game theory offers an interpretable mathematical framework for modeling multiagent interactions. However, its applicability in realworld robotics applications is hindered by several challenges, such as unknown agents' preferences and goals. To address these challenges, we establish a connection between differential games, optimal control, and energybased models and demonstrate how existing approaches can be unified under our proposed Energybased Potential Game formulation. Building upon this formulation, this work introduces a new endtoend learning application that combines neural networks for gameparameter inference with a differentiable gametheoretic optimization layer, acting as an inductive bias. The experiments using simulated mobile robot pedestrian interactions and realworld automated driving data provide empirical evidence that the gametheoretic layer improves the predictive performance of various neural network backbones. 
Christopher Diehl · Tobias Klosek · Martin Krueger · Nils Murzyn · Torsten Bertram 🔗 


Importance Weighted ActorCritic for Optimal Conservative Offline Reinforcement Learning
(
Poster
)
link »
We propose ACrab (ActorCritic Regularized by Average Bellman error), a new algorithm for offline reinforcement learning (RL) in complex environments with insufficient data coverage. Our algorithm combines the marginalized importance sampling framework with the actorcritic paradigm, where the critic returns evaluations of the actor (policy) that are pessimistic relative to the offline data and have a small average (importanceweighted) Bellman error. Compared to existing methods, our algorithm simultaneously offers a number of advantages:(1) It achieves the optimal statistical rate of $1/\sqrt{N}$where $N$ is the size of offline datasetin converging to the best policy covered in the offline dataset, even when combined with general function approximators.(2) It relies on a weaker *average* notion of policy coverage (compared to the $\ell_\infty$ singlepolicy concentrability) that exploits the structure of policy visitations.(3) It outperforms the datacollection behavior policy over a wide range of specific hyperparameters.

Hanlin Zhu · Paria Rashidinejad · Jiantao Jiao 🔗 


Fit Like You Sample: SampleEfficient Generalized Score Matching from Fast Mixing Markov Chains
(
Poster
)
link »
Score matching is an approach to learning probability distributions parametrized up to a constant of proportionality (e.g. EBMs). The idea is to fit the score of the distribution (i.e. $\nabla_x \log p(x)$), rather than the likelihood, thus avoiding the need to evaluate the constant of proportionality. While there's a clear algorithmic benefit, the statistical "cost" can be steep: recent work by Koehler et al '23 showed that for distributions that have poor isoperimetric properties (a large Poincare or logSobolev constant), score matching is substantially statistically less efficient than maximum likelihood. However, many natural realistic distributions, e.g. multimodal distributions as simple as a mixture of two Gaussianseven in one dimensionhave a poor Poincare constant.In this paper, we show a close connection between the mixing time of an arbitrary Markov process with generator $\mathcal{L}$ and a generalized score matching loss that tries to fit $\frac{\mathcal{O}p}{p}$. We instantiate this framework with several examples. In the special case of $\mathcal{O} = \nabla_x$, and $\mathcal{L}$ being the generator of Langevin diffusion, this generalizes and recovers the results from Koehler et al '23. If $\mathcal{L}$ corresponds to a Markov process corresponding to a continuous version of simulated tempering, we show the corresponding generalized score matching loss is a Gaussianconvolution annealed score matching loss, akin to the one proposed in SongErmon '19. Moreover, we show that if the distribution being learned is a mixture of $K$ Gaussians in $d$ dimensions, the sample complexity of annealed score matching is polynomial in $d$ and $K$  obviating the Poincar'e constantbased lower bounds of the basic score matching loss shown in Koehler et al. This is the first result characterizing the benefits of annealing for score matchinga crucial component in more sophisticated scorebased approaches like SongErmon '19.

Yilong Qin · Andrej Risteski 🔗 


IQLTDMPC: Implicit QLearning for Hierarchical Model Predictive Control
(
Poster
)
link »
Modelbased reinforcement learning (RL) has shown great promise due to its sample efficiency, but still struggles with longhorizon sparsereward tasks, especially in offline settings where the agent learns from a fixed dataset. We hypothesize that modelbased RL agents struggle in these environments due to a lack of longterm planning capabilities, and that planning in a temporally abstract model of the environment can alleviate this issue. In this paper, we make two key contributions: 1) we introduce an offline modelbased RL algorithm, IQLTDMPC, that extends the stateoftheart Temporal Difference Learning for Model Predictive Control (TDMPC) with Implicit QLearning (IQL); 2) we propose to use IQLTDMPC as a Manager in a hierarchical setting with any offtheshelf offline RL algorithm as a Worker. More specifically, we pretrain a temporally abstract IQLTDMPC Manager to predict "intent embeddings", which roughly correspond to subgoals, via planning. We empirically show that augmenting state representations with intent embeddings generated by an IQLTDMPC manager significantly improves offtheshelf offline RL agents' performance on some of the most challenging D4RL benchmark tasks. For instance, the offline RL algorithms AWAC, TD3BC, DT, and CQL all get zero or nearzero normalized evaluation scores on the medium and large antmaze tasks, while our modification gives an average score over 40. 
Yingchen Xu · Rohan Chitnis · Bobak Hashemi · Lucas Lehnert · Urun Dogan · Zheqing Zhu · Olivier Delalleau 🔗 


Parallel Sampling of Diffusion Models
(
Poster
)
link »
Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 24x across a range of robotics and image generation models, giving stateoftheart sampling speeds of 0.2s on 100step DiffusionPolicy and 16s on 1000step StableDiffusionv2 with no measurable degradation of task reward, FID score, or CLIP score. 
Andy Shih · Suneel Belkhale · Stefano Ermon · Dorsa Sadigh · Nima Anari 🔗 


Fast Approximation of the Generalized SlicedWasserstein Distance
(
Poster
)
link »
Generalized slicedWasserstein distance is a variant of slicedWasserstein distance that exploits the power of nonlinear projection through a given defining function to better capture the complex structures of probability distributions. Similar to the slicedWasserstein distance, generalized slicedWasserstein is defined as an expectation over random projections which can be approximated by the Monte Carlo method. However, the complexity of that approximation can be expensive in highdimensional settings. To that end, we propose to form deterministic and fast approximations of the generalized slicedWasserstein distance by using the concentration of random projections when the defining functions are polynomial function and neural network type function. Our approximations hinge upon an important result that onedimensional projections of a highdimensional random vector are approximately Gaussian. 
Dung Le · Huy Nguyen · Khai Nguyen · Nhat Ho 🔗 


A PolicyDecoupled Method for HighQuality Data Augmentation in Offline Reinforcement Learning
(
Poster
)
link »
Offline reinforcement learning (ORL) has gained attention as a means of training reinforcement learning models using precollected static data. To address the issue of limited data and improve downstream ORL performance, recent work has attempted to expand the dataset's coverage through data augmentation. However, most of these methods are tied to a specific policy (policydependent), where the generated data can only guarantee to support the current downstream ORL policy, limiting its usage scope on other downstream policies. Moreover, the quality of synthetic data is often not wellcontrolled, which limits the potential for further improving the downstream policy. To tackle these issues, we propose HIghquality POlicyDEcoupled (HIPODE), a novel data augmentation method for ORL. On the one hand, HIPODE generates highquality synthetic data by selecting states near the dataset distribution with potentially high value among candidate states using the negative sampling technique. On the other hand, HIPODE is policydecoupled, thus can be used as a common plugin method for any downstream ORL process. We conduct experiments on the widely studied TD3BC and CQL algorithms, and the results show that HIPODE outperforms the stateoftheart policydecoupled data augmentation method and most prevalent modelbased ORL methods on D4RL benchmarks. 
Shixi Lian · Yi Ma · Jinyi Liu · Jianye Hao · Yan Zheng · Zhaopeng Meng 🔗 


On the Generalization Capacities of Neural Controlled Differential Equations
(
Poster
)
link »
We consider a supervised learning setup in which the goal is to predicts an outcome from a sample of irregularly sampled time series using Neural Controlled Differential Equations (Kidger, Morrill, et al. 2020). In our framework, the time series is a discretization of an unobserved continuous path, and the outcome depends on this path through a controlled differential equation with unknown vector field. Learning with discrete data thus induces a discretization bias, which we precisely quantify. Using theoretical results on the continuity of the flow of controlled differential equations, we show that the approximation bias is directly related to the approximation error of a Lipschitz function defining the generative model by a shallow neural network. By combining these result with recent work linking the Lipschitz constant of neural networks to their generalization capacities, we upper bound the generalization gap between the expected loss attained by the empirical risk minimizer and the expected loss of the true predictor. 
Linus Bleistein · Agathe Guilloux 🔗 


Factor Learning Portfolio Optimization Informed by ContinuousTime Finance Models
(
Poster
)
link »
We study financial portfolio optimization in the presence of unknown and uncontrolled system variables referred to as stochastic factors. Existing work falls into two distinct categories: (i) reinforcement learning employs endtoend policy learning with flexible factor representation, but does not precisely model the dynamics of asset prices or factors; (ii) continuoustime finance methods, in contrast, take advantage of explicitly modeled dynamics but prespecify, rather than learn, factor representation. We propose FaLPO (factor learning portfolio optimization), a framework that interpolates between these two approaches. Specifically, FaLPO hinges on deep policy gradient to learn a performant investment policy that takes advantage of flexible representation for stochastic factors. Meanwhile, FaLPO also incorporates continuoustime finance models when modeling the dynamics. It uses the optimal policy functional form derived from such models and optimizes an objective that combines policy learning and model calibration. We prove the convergence of FaLPO and provide performance guarantees via a finitesample bound. On both synthetic and realworld portfolio optimization tasks, we observe that FaLPO outperforms five leading methods. Finally, we show that FaLPO can be extended to other decisionmaking problems with stochastic factors. 
Sinong Geng · houssam nassif · Zhaobin Kuang · A. Max Reppen · Ronnie Sircar 🔗 


Modular Hierarchical Reinforcement Learning for Robotics: Improving Scalability and Generalizability
(
Poster
)
link »
We present a novel software architecture for reinforcement learning applied to robotics that emphasizes modularity and reusability. Our method treats each agent as a plugandplay ROS node that can be easily integrated into a larger HRL system, similar to using software libraries in programming. This modular approach improves the scalability and generalizability of pretrained reinforcement learning agents. We demonstrate the effectiveness of our method by solving the realworld task of stacking three objects with two different robots that were trained only in simulation. Our results show that the modular approach significantly reduces the training and setup time required compared to a vanilla reinforcement learning baseline. Overall, our work showcases the potential of using trained agents as modules to enable the development of more complex and adaptable robotics applications. 
Mihai Anca · Mark Hansen · Matthew Studley 🔗 


Parameterized projected Bellman operator
(
Poster
)
link »
The Bellman operator is a cornerstone of reinforcement learning (RL), widely used from traditional valuebased methods to modern actorcritic approaches. In problems with unknown models, the Bellman operator is estimated via transition samples that strongly determine its behavior, as uninformative samples can result in negligible updates or long detours before reaching the fixed point. In this paper, we introduce the novel idea of an operator that acts on the parameters of actionvalue function approximators. Our novel operator can obtain a sequence of actionvalue function parameters that progressively approaches the ones of the optimal actionvalue function. This means that we merge the traditional twostep procedure consisting of applying the Bellman operator and subsequently projecting onto the space of actionvalue function. For this reason, we call our novel operator projected Bellman operator (PBO). We formulate an optimization problem to learn PBOs for generic sequential decisionmaking problems, and we analyze the PBO properties in two representative classes of RL problems. Furthermore, we study the use of PBO under the lens of the approximate value iteration framework, devising algorithmic implementations to learn PBOs in both offline and online settings resorting to neural network regression. Eventually, we empirically evince how PBO can overcome the limitations of classical methods, opening up new research directions as a novel paradigm in RL. 
Théo Vincent · Alberto Maria Metelli · Jan Peters · Marcello Restelli · Carlo D'Eramo 🔗 


Improving OfflinetoOnline Reinforcement Learning with QEnsembles
(
Poster
)
link »
Offline reinforcement learning (RL) is a learning paradigm where an agent learns from a fixed dataset of experience. However, learning solely from a static dataset can limit the performance due to the lack of exploration. To overcome it, offlinetoonline RL combines offline pretraining with online finetuning, which enables the agent to further refine its policy by interacting with the environment in realtime. Despite its benefits, existing offlinetoonline RL methods suffer from performance degradation and slow improvement during the online phase. To tackle these challenges, we propose a novel framework called Ensemblebased OfflinetoOnline (E2O) RL. By increasing the number of Qnetworks, we seamlessly bridge offline pretraining and online finetuning without degrading performance. Moreover, to expedite online performance enhancement, we appropriately loosen the pessimism of Qvalue estimation and incorporate ensemblebased exploration mechanisms into our framework. Experimental results demonstrate that E2O can substantially improve the training stability, learning efficiency, and final performance of existing offline RL methods during online finetuning on a range of locomotion and navigation tasks, significantly outperforming existing offlinetoonline RL methods. 
Kai Zhao · Yi Ma · Jinyi Liu · Jianye Hao · Yan Zheng · Zhaopeng Meng 🔗 


Modelbased Policy Optimization under Approximate Bayesian Inference
(
Poster
)
link »
Modelbased reinforcement learning algorithms~(MBRL) present an exceptional potential to enhance sample efficiency within the realm of online reinforcement learning (RL). Nevertheless, a substantial proportion of prevalent MBRL algorithms fail to adequately address the dichotomy of exploration and exploitation. Posterior sampling reinforcement learning (PSRL) emerges as an innovative strategy adept at balancing exploration and exploitation, albeit its theoretical assurances are contingent upon exact inference. In this paper, we show that adopting the same methodology as in exact PSRL can be suboptimal under approximate inference. Motivated by the analysis, we propose an improved factorization for the posterior distribution of polices by removing the conditional independence between the policy and data given the model. By adopting such a posterior factorization, we further propose a general algorithmic framework for PSRL under approximate inference and a practical instantiation of it. Empirically, our algorithm can surpass baseline methods by a significant margin on both dense rewards and sparse rewards tasks from the Deepmind control suite, OpenAI Gym and Metaworld benchmarks. 
Chaoqi Wang · Yuxin Chen · Kevin Murphy 🔗 


Learning FineGrained Bimanual Manipulation with LowCost Hardware
(
Poster
)
link »
Fine manipulation tasks, such as threading cable ties or slotting a battery, are notoriously difficult for robots because they require precision, careful coordination of contact forces, and closedloop visual feedback. Performing these tasks typically requires highend robots, accurate sensors, or careful calibration, which can be expensive and difficult to set up. Can learning enable lowcost and imprecise hardware to perform these fine manipulation tasks? We present a lowcost system that performs endtoend imitation learning directly from real demonstrations, collected with a custom teleoperation interface. Imitation learning, however, presents its own challenges, particularly in highprecision domains: errors in the policy can compound over time, and human demonstrations can be nonstationary. To address these challenges, we develop a simple yet novel algorithm, Action Chunking with Transformers (ACT), which learns a generative model over action sequences. ACT allows the robot to learn 6 difficult tasks in the real world, such as opening a translucent condiment cup and slotting a battery with 8090% success, with only 10 minutes worth of demonstrations. 
Tony Zhao · Vikash Kumar · Sergey Levine · Chelsea Finn 🔗 


On the Imitation of NonMarkovian Demonstrations: From LowLevel Stability to HighLevel Planning
(
Poster
)
link »
We propose a theoretical framework for studying the imitation of stochastic, nonMarkovian, potentially multimodal expert demonstrations in nonlinear dynamical systems. Our framework invokes lowlevel controllers  either learned or implicit in positioncommand control  to stabilize imitation policies around expert demonstrations. We show that with (a) a suitable lowlevel stability guarantee and (b) a stochastic continuity property of the learned policy we call ``total variation continuity'' (TVC), an imitator that accurately estimates actions on the demonstrator's state distribution closely matches the demonstrator's distribution over entire trajectories. We then show that TVC can be ensured with minimal degradation of accuracy by combining a popular dataaugmentation regimen with a novel algorithmic trick: adding augmentation noise at execution time. We instantiate our guarantees for policies parameterized by diffusion models and prove that if the learner accurately estimates the score of the (noiseaugmented) expert policy, then the distribution of imitator trajectories is close to the demonstrator distribution in a natural optimal transport distance. Our analysis constructs intricate couplings between noiseaugmented trajectories, a technique that may be of independent interest. We conclude by empirically validating our algorithmic recommendations. 
Adam Block · Daniel Pfrommer · Max Simchowitz 🔗 


On Convergence of Approximate Schr\"{o}dinger Bridge with Bounded Cost
(
Poster
)
link »
The Schr\"odinger bridge has demonstrated promising applications in generative models. It is an entropyregularized optimaltransport (EOT) approach that employs the iterative proportional fitting (IPF) algorithm to solve an alternating projection problem. However, due to the complexity of finding precise solutions for the projections, approximations are often required. In our study, we study the convergence of the IPF algorithm using approximated projections and a bounded cost function. Our results demonstrate an approximate linear convergence with bounded perturbations. While the outcome is not unexpected, the rapid linear convergence towards smooth trajectories suggests the potential to examine the efficiency of the Schrödinger bridge compared to diffusion models. 
Wei Deng · Yu Chen · Tianjiao N Yang · Hengrong Du · Qi Feng · Ricky T. Q. Chen 🔗 


Online Control with Adversarial Disturbance for Continuoustime Linear Systems
(
Poster
)
link »
We study online control for continuoustime linear systems with finite sampling rates, where the objective is to design an online procedure that learns under nonstochastic noise and performs comparably to a fixed optimal linear controller. We present a novel twolevel online algorithm, by integrating a higherlevel learning strategy and a lowerlevel feedback control strategy. This method offers a practical and robust solution for online control, which achieves sublinear regret. Our work provides one of the first nonasymptotic results for controlling continuoustime linear systems a with finite number of interactions with the system. 
Jingwei Li · Jing Dong · Baoxiang Wang · Jingzhao Zhang 🔗 


A Flexible Diffusion Model
(
Poster
)
link »
Denoising Diffusion (scorebased) generative models have been widely used for modeling various types of complex data, including images, audio, point clouds, and biomolecules. Recently, the deep connection between forwardbackward stochastic differential equations (SDEs) and diffusionbased models has been revealed, and several new variants of SDEs are proposed (e.g., subVP, criticallydamped Langevin) along this line. Despite the empirical success of several handcrafted forward SDEs, a great quantity of potentially promising forward SDEs remains unexplored. In this work, we propose a general framework for parameterizing the diffusion models, especially the spatial part of the forward SDEs. A systematic formalism is introduced with theoretical guarantees, and its connection with previous diffusion models is leveraged. Finally, we demonstrate the theoretical advantage of our method from the variational optimization perspective. Numerical experiments on synthetic datasets, MNIST and CIFAR10 are presented to validate the effectiveness of our framework. 
weitao du · He Zhang · Tao Yang · Yuanqi Du 🔗 


Synthetic Experience Replay
(
Poster
)
link »
A key theme in the past decade has been that when large neural networks and large datasets combine they can produce remarkable results. In deep reinforcement learning (RL), this paradigm is commonly made possible through experience replay, whereby a dataset of past experiences is used to train a policy or value function. However, unlike in supervised or selfsupervised learning, an RL agent has to collect its own data, which is often limited. Thus, it is challenging to reap the benefits of deep learning, and even small neural networks can overfit at the start of training. In this work, we leverage the tremendous recent progress in generative modeling and propose Synthetic Experience Replay (SynthER), a diffusionbased approach to flexibly upsample an agent's collected experience. We show that SynthER is an effective method for training RL agents across offline and online settings, in both proprioceptive and pixelbased environments. In offline settings, we observe drastic improvements when upsampling small offline datasets and see that additional synthetic data also allows us to effectively train larger networks. Furthermore, SynthER enables online agents to train with a much higher updatetodata ratio than before, leading to a significant increase in sample efficiency, without any algorithmic changes. Finally, we opensource our code at https://anonymous.4open.science/r/syntherE717/. 
Cong Lu · Philip Ball · YeeWhye Teh · Jack ParkerHolder 🔗 


Fairness In a NonStationary Environment From an Optimal Control Perspective
(
Poster
)
link »
The performance of stateoftheart machine learning models is observed to degrade in scenarios involving underrepresented demographic populations during training.This issue has been extensively studied within a supervised learning framework where data distribution remains unchanged.Nonetheless, realworld use cases often encounter distribution shifts induced by the models in deployment. For example, performance bias against minority users can affect customer retention rates, thereby skewing available data from active users due to the absence of minority user input.This feedback effect further exacerbates the discrepancy across various demographic groups in subsequent time steps. To mitigate this problem, we introduce asymptotic fairness, a criterion that aims at preserving sustained model performance across all demographic populations.In addition, we construct a surrogate retention system, based on existing literature on evolutionary population dynamics, to approximate the dynamics of distribution shifts on active user counts. This system allows the aim of achieving asymptotic fairness to be formulated as an optimal control problem.To evaluate the effectiveness of the proposed method,we design a generic simulation environment that simulates the population dynamics of the feedback effect between user retention and model performance.When we deploy the models to this simulation environment,by considering longterm planning,the optimal control solution outperforms existing baseline methods, demonstrating superior performance. 
Zhuotong Chen · Qianxiao Li · Zheng Zhang 🔗 


Physicsinformed Localized Learning for AdvectionDiffusionReaction Systems
(
Poster
)
link »
The global push for new energy solutions, such as Geothermal, and Carbon Capture and Sequestration initiatives has thrust new demands upon the current stateof theart subsurface fluid simulators. The requirement to be able to simulate a large order of reservoir states simultaneously in a short period of time has opened the door of opportunity for the application of machine learning techniques for surrogate modelling. We propose a novel physicsinformed and boundary conditionsaware Localized Learning method which extends the EmbedtoControl (E2C) and EmbedtoControl and Observed (E2CO) models to learn local representations of global state variables in an AdvectionDiffusion Reaction system. We show that our model, trained on reservoir simulation data, is able to predict future states of the system for a given a set of controls to a great deal of accuracy with only a fraction of the available information. It hence reduces training times significantly compared to the original E2C and E2CO models, lending to its benefit in application to optimal control problems. 
Surya Sathujoda · Soham Sheth 🔗 


On the effectiveness of neural priors in modeling dynamical systems
(
Poster
)
link »
Modelling dynamical systems is an integral component for understanding the natural world. To this end, neural networks are becoming an increasingly popular candidate owing to their ability to learn complex functions from large amounts of data. Despite this recent progress, there has not been an adequate discussion on the architectural regularization that neural networks offer when learning such systems, hindering their efficient usage. In this paper, we initiate a discussion in this direction using coordinate networks as a test bed. We interpret dynamical systems and coordinate networks from a signal processing lens, and show that simple coordinate networks with few layers can be used to solve multiple problems in modelling dynamical systems, without any explicit regularizers. 
Sameera Ramasinghe · Hemanth Saratchandran · Violetta Shevchenko · Simon Lucey 🔗 


Bridging PhysicsInformed Neural Networks with Reinforcement Learning: HamiltonJacobiBellman Proximal Policy Optimization (HJBPPO)
(
Poster
)
link »
This paper introduces the HamiltonJacobiBellman Proximal Policy Optimization (HJBPPO) algorithm into reinforcement learning. The HamiltonJacobiBellman (HJB) equation is used in control theory to evaluate the optimality of the value function. Our work combines the HJB equation with reinforcement learning in continuous state and action spaces to improve the training of the value network. We treat the value network as a PhysicsInformed Neural Network (PINN) to solve for the HJB equation by computing its derivatives with respect to its inputs exactly. The Proximal Policy Optimization (PPO)Clipped algorithm is improvised with this implementation as it uses a value network to compute the objective function for its policy network. The HJBPPO algorithm shows an improved performance compared to PPO on the MuJoCo environments. 
Amartya Mukherjee · Jun Liu 🔗 


ActorCritic Methods using PhysicsInformed Neural Networks: Control of a 1D PDE Model for FluidCooled Battery Packs
(
Poster
)
link »
This paper proposes an actorcritic algorithm for controlling the temperature of a battery pack using a cooling fluid. This is modeled by a coupled 1D partial differential equation (PDE) with a controlled advection term that determines the speed of the cooling fluid. The HamiltonJacobiBellman (HJB) equation is a PDE that evaluates the optimality of the value function and determines an optimal controller. We propose an algorithm that treats the value network as a PhysicsInformed Neural Network (PINN) to solve for the continuoustime HJB equation rather than a discretetime Bellman optimality equation, and we derive an optimal controller for the environment that we exploit to achieve optimal control. Our experiments show that a hybridpolicy method that updates the value network using the HJB equation and updates the policy network identically to PPO achieves the best results in the control of this PDE system. 
Amartya Mukherjee · Jun Liu 🔗 


Optimization or Architecture: What Matters in NonLinear Filtering?
(
Poster
)
link »
In nonlinear filtering, it is traditional to compare nonlinear architectures such as neural networks to the standard linear Kalman Filter (KF). We observe that this methodology mixes the evaluation of two separate components: the nonlinear architecture, and the numeric optimization method. In particular, the nonlinear model is often optimized, whereas the reference KF model is not. We argue that both should be optimized similarly. We suggest the Optimized KF (OKF), which adjusts numeric optimization to the positivedefinite KF parameters. We demonstrate how a significant advantage of a neural network over the KF may entirely vanish once the KF is optimized using OKF. This implies that experimental conclusions of certain previous studies were derived from a flawed process. The benefits of OKF over the nonoptimized KF are further studied theoretically and empirically, where OKF demonstrates consistently improved accuracy in a variety of problems. 
Ido Greenberg · Netanel Yannay · Shie Mannor 🔗 


Gradientfree training of neural ODEs for system identification and control using ensemble Kalman inversion
(
Poster
)
link »
Ensemble Kalman inversion (EKI) is a sequential Monte Carlo method used to solve inverse problems within a Bayesian framework. Unlike backpropagation, EKI is a gradientfree optimization method that only necessitates the evaluation of artificial neural networks in forward passes. In this study, we examine the effectiveness of EKI in training neural ordinary differential equations (neural ODEs) for system identification and control tasks. To apply EKI to optimal control problems, we formulate inverse problems that incorporate a Tikhonovtype regularization term. Our numerical results demonstrate that EKI is an efficient method for training neural ODEs in system identification and optimal control problems, with runtime and quality of solutions that are competitive with commonly used gradientbased optimizers. 
Lucas Böttcher 🔗 


What is the Solution for StateAdversarial MultiAgent Reinforcement Learning?
(
Poster
)
link »
Various methods for MultiAgent Reinforcement Learning (MARL) have been developed with the assumption that agents' policies are based on accurate state information. However, policies learned through Deep Reinforcement Learning (DRL) are susceptible to adversarial state perturbation attacks. In this work, we propose a StateAdversarial Markov Game (SAMG) and make the first attempt to investigate the fundamental properties of MARL under state uncertainties. Our analysis shows that the commonly used solution concepts of optimal agent policy and robust Nash equilibrium do not always exist in SAMGs. To circumvent this difficulty, we consider a new solution concept called robust agent policy, where agents aim to maximize the worstcase expected state value. We prove the existence of robust agent policy for finite state and finite action SAMGs. Additionally, we propose a Robust MultiAgent Adversarial ActorCritic (RMA3C) algorithm to learn robust policies for MARL agents under state uncertainties. Our experiments demonstrate that our algorithm outperforms existing methods when faced with state perturbations and greatly improves the robustness of MARL policies. 
Songyang Han · Sanbao Su · Sihong He · Shuo Han · Haizhao Yang · Fei Miao 🔗 
Author Information
Valentin De Bortoli (CNRS, ENS Ulm (projet NORIA))
Charlotte Bunne (ETH Zurich)
GuanHorng Liu (Georgia Institute of Technology)
Tianrong Chen (Georgia Institute of Technology)
Maxim Raginsky
Pratik Chaudhari (UPenn, AWS)
Melanie Zeilinger (ETH Zurich)
Animashree Anandkumar (Caltech and NVIDIA)
More from the Same Authors

2021 : Continuous Doubly Constrained Batch Reinforcement Learning »
Rasool Fakoor · Jonas Mueller · Kavosh Asadi · Pratik Chaudhari · Alex Smola 
2022 : Recovering Stochastic Dynamics via Gaussian Schrödinger Bridges »
YaPing Hsieh · Charlotte Bunne · Marco Cuturi · Andreas Krause 
2022 : PhysicsInformed Neural Operator for Learning Partial Differential Equations »
Zongyi Li · Hongkai Zheng · Nikola Kovachki · David Jin · Haoxuan Chen · Burigede Liu · Kamyar Azizzadenesheli · Animashree Anandkumar 
2022 : Riemannian Diffusion Schr\"odinger Bridge »
James Thornton · Valentin De Bortoli · Michael Hutchinson · Emile Mathieu · Yee Whye Teh · Arnaud Doucet 
2022 : Recovering Stochastic Dynamics via Gaussian Schrödinger Bridges »
Charlotte Bunne · YaPing Hsieh · Marco Cuturi · Andreas Krause 
2023 : The Training Process of Many Deep Networks Explores the Same LowDimensional Manifold »
Jialin Mao · Han Kheng Teoh · Itay Griniasty · Rahul Ramesh · Rubing Yang · Mark Transtrum · James Sethna · Pratik Chaudhari 
2023 : Unbalanced Diffusion Schrödinger Bridge »
Matteo Pariset · YaPing Hsieh · Charlotte Bunne · Andreas Krause · Valentin De Bortoli 
2023 : Aligned Diffusion Schrödinger Bridges »
Vignesh Ram Somnath · Matteo Pariset · YaPing Hsieh · Maria Rodriguez Martinez · Andreas Krause · Charlotte Bunne 
2023 : Improved sampling via learned diffusions »
Julius Berner · Lorenz Richter · GuanHorng Liu 
2023 : Game Theoretic Neural ODE Optimizer »
Panagiotis Theodoropoulos · GuanHorng Liu · Tianrong Chen · Evangelos Theodorou 
2023 : Budgeting Counterfactual for Offline RL »
Yao Liu · Pratik Chaudhari · Rasool Fakoor 
2023 : LeanDojo: Theorem Proving with RetrievalAugmented Language Models »
Kaiyu Yang · Aidan Swope · Alexander Gu · Rahul Chalamala · Shixing Yu · Saad Godil · Ryan Prenger · Animashree Anandkumar 
2023 : On The Ability of Transformers To Learn Recursive Patterns »
Dylan Zhang · Curt Tigges · Talia Ringer · Stella Biderman · Maxim Raginsky 
2023 : Panel Discussion »
Chenlin Meng · Yang Song · Yilun Xu · Ricky T. Q. Chen · Charlotte Bunne · Arash Vahdat 
2023 Poster: The Value of OutofDistribution Data »
Ashwin De Silva · Rahul Ramesh · Carey Priebe · Pratik Chaudhari · Joshua Vogelstein 
2023 Poster: I$^2$SB: ImagetoImage Schrödinger Bridge »
GuanHorng Liu · Arash Vahdat · DeAn Huang · Evangelos Theodorou · Weili Nie · Anima Anandkumar 
2023 Poster: A Picture of the Space of Typical Learnable Tasks »
Rahul Ramesh · Jialin Mao · Itay Griniasty · Rubing Yang · Han Kheng Teoh · Mark Transtrum · James Sethna · Pratik Chaudhari 
2023 Poster: SE(3) diffusion model with application to protein backbone generation »
Jason Yim · Brian Trippe · Valentin De Bortoli · Emile Mathieu · Arnaud Doucet · Regina Barzilay · Tommi Jaakkola 
2023 Tutorial: Optimal Transport in Learning, Control, and Dynamical Systems »
Charlotte Bunne · marco cuturi 
2022 : Q/A: Melanie Zeilinger »
Melanie Zeilinger 
2022 : Invited Talk: Melanie Zeilinger »
Melanie Zeilinger 
2022 Poster: Diffusion Models for Adversarial Purification »
Weili Nie · Brandon Guo · Yujia Huang · Chaowei Xiao · Arash Vahdat · Animashree Anandkumar 
2022 Poster: Does the Data Induce Capacity Control in Deep Learning? »
Rubing Yang · Jialin Mao · Pratik Chaudhari 
2022 Spotlight: Diffusion Models for Adversarial Purification »
Weili Nie · Brandon Guo · Yujia Huang · Chaowei Xiao · Arash Vahdat · Animashree Anandkumar 
2022 Spotlight: Does the Data Induce Capacity Control in Deep Learning? »
Rubing Yang · Jialin Mao · Pratik Chaudhari 
2022 Poster: Deep Reference Priors: What is the best way to pretrain a model? »
Yansong Gao · Rahul Ramesh · Pratik Chaudhari 
2022 Spotlight: Deep Reference Priors: What is the best way to pretrain a model? »
Yansong Gao · Rahul Ramesh · Pratik Chaudhari 
2022 Poster: Langevin Monte Carlo for Contextual Bandits »
Pan Xu · Hongkai Zheng · Eric Mazumdar · Kamyar Azizzadenesheli · Animashree Anandkumar 
2022 Poster: Understanding The Robustness in Vision Transformers »
Zhou Daquan · Zhiding Yu · Enze Xie · Chaowei Xiao · Animashree Anandkumar · Jiashi Feng · Jose M. Alvarez 
2022 Spotlight: Understanding The Robustness in Vision Transformers »
Zhou Daquan · Zhiding Yu · Enze Xie · Chaowei Xiao · Animashree Anandkumar · Jiashi Feng · Jose M. Alvarez 
2022 Spotlight: Langevin Monte Carlo for Contextual Bandits »
Pan Xu · Hongkai Zheng · Eric Mazumdar · Kamyar Azizzadenesheli · Animashree Anandkumar 
2021 : Spotlight Set 23  MultiScale Representation Learning on Proteins »
Workshop CompBio · Charlotte Bunne 
2021 : Morning Poster Session: JKOnet: Proximal Optimal Transport Modeling of Population Dynamics »
Charlotte Bunne 
2021 : Invited Speaker: Animashree Anandkumar: Stabilityaware reinforcement learning in dynamical systems »
Animashree Anandkumar 
2021 : Contributed Talk: JKOnet: Proximal Optimal Transport Modeling of Population Dynamics »
Charlotte Bunne 
2021 : Invited Talk: Maxim Raginsky »
Maxim Raginsky 
2021 Workshop: Workshop on Socially Responsible Machine Learning »
Chaowei Xiao · Animashree Anandkumar · Mingyan Liu · Dawn Song · Raquel Urtasun · Jieyu Zhao · Xueru Zhang · Cihang Xie · Xinyun Chen · Bo Li 
2021 Poster: An InformationGeometric Distance on the Space of Tasks »
Yansong Gao · Pratik Chaudhari 
2021 Spotlight: An InformationGeometric Distance on the Space of Tasks »
Yansong Gao · Pratik Chaudhari 
2021 Poster: Dynamic Game Theoretic Neural Optimizer »
GuanHorng Liu · Tianrong Chen · Evangelos Theodorou 
2021 Oral: Dynamic Game Theoretic Neural Optimizer »
GuanHorng Liu · Tianrong Chen · Evangelos Theodorou 
2020 : Q&A: Anima Anandakumar »
Animashree Anandkumar · Jessica Forde 
2020 : Invited Talks: Anima Anandakumar »
Animashree Anandkumar 
2020 Poster: A FreeEnergy Principle for Representation Learning »
Yansong Gao · Pratik Chaudhari 
2020 : Mentoring Panel: Doina Precup, Deborah Raji, Anima Anandkumar, Angjoo Kanazawa and Sinead Williamson (moderator). »
Doina Precup · Inioluwa Raji · Angjoo Kanazawa · Sinead A Williamson · Animashree Anandkumar 
2019 Poster: Learning Generative Models across Incomparable Spaces »
Charlotte Bunne · David AlvarezMelis · Andreas Krause · Stefanie Jegelka 
2019 Oral: Learning Generative Models across Incomparable Spaces »
Charlotte Bunne · David AlvarezMelis · Andreas Krause · Stefanie Jegelka 
2018 Poster: StrassenNets: Deep Learning with a Multiplication Budget »
Michael Tschannen · Aran Khanna · Animashree Anandkumar 
2018 Oral: StrassenNets: Deep Learning with a Multiplication Budget »
Michael Tschannen · Aran Khanna · Animashree Anandkumar