Timezone: »
Recent advances in algorithmic design and principled, theory-driven deep learning architectures have sparked a growing interest in control and dynamical system theory. Complementary, machine learning plays an important role in enhancing existing control theory algorithms in terms of performance and scalability. The boundaries between both disciplines are blurring even further with the rise of modern reinforcement learning, a field at the crossroad of data-driven control theory and machine learning. This workshop aims to unravel the mutual relationship between learning, control, and dynamical systems and to shed light on recent parallel developments in different communities. Strengthening the connection between learning and control will open new possibilities for interdisciplinary research areas.
Fri 12:00 p.m. - 12:45 p.m.
|
On optimal control and machine learning
(
Tutorial
)
SlidesLive Video » This talk tours the optimal control and machine learning methodologies behind recent breakthroughs in the field. These are crucial components for building agents capable of computationally modeling and interacting with our world via planning and reasoning, e.g. for robotics, aircrafts, autonomous vehicles, games, economics, finance, and language, as well as agricultural, biomedical,chemical, industrial, and mechanical systems. We will start with 1) a lightweight introduction to optimal control, and then cover 2) machine learning for optimal control --- this includes reinforcement learning and overviews how the powerful abstractive and predictive capabilities of machine learning can drastically improve every part of a control system; and 3) optimal control for machine learning --- surprisingly in this opposite direction, some machine learning problems are able to be formulated as control problems and solved with optimal control methods, e.g. parts of diffusion models, optimal transport,and optimizing the parameters of models such as large language models with reinforcement learning. |
Brandon Amos 🔗 |
Fri 12:45 p.m. - 1:30 p.m.
|
Two-for-one: diffusion models and force fields for coarse-grained molecular dynamics
(
Presentation
)
SlidesLive Video » In this work I will cover work from the Microsoft Research AI4Science team on the use of score-based generative modeling for coarse-graining (CG) molecular dynamics simulations. By training a diffusion model on protein structures from molecular dynamics simulations we show that its score function approximates a force field that can directly be used to simulate CG molecular dynamics. While having a vastly simplified training setup compared to previous work, we demonstrate that our approach leads to improved performance across several small- to medium-sized protein simulations, reproducing the CG equilibrium distribution, and preserving dynamics of all-atom simulations such as protein folding events. |
Rianne Van den Berg 🔗 |
Fri 1:30 p.m. - 1:45 p.m.
|
Transport, VI, and Diffusions
(
Presentation
)
link »
SlidesLive Video » This paper explores the connections between optimal transport and variational inference, with a focus on forward and reverse time stochastic differential equations and Girsanov transformations.We present a principled and systematic framework for sampling and generative modelling centred around divergences on path space. Our work culminates in the development of a novel score-based annealed flow technique and a regularised iterative proportional fitting (IPF)-type objective, departing from the sequential nature of standard IPF. Through a series of generative modelling examples and a double-well-based rare event task, we showcase the potential of the proposed methods. |
Francisco Vargas · Nikolas Nüsken 🔗 |
Fri 1:45 p.m. - 2:30 p.m.
|
Imposing and learning structure in OT displacements through cost engineering
(
Presentation
)
link »
SlidesLive Video » I will highlight in this work the flexibility provided by the Gangbo-McCann theorem, which provides a generic way to tie kantorovich dual potential solutions to optimal maps for the Monge problem. We show in particular how setting the ground cost to the squared-Euclidean distance + a regularizer induces displacements that have a structure that is well suited to that regularizer (e.g. sparse if that regularizer is the L1 norm). We propose an approach, in more recent work, to learn parameters of that regularizer. |
Marco Cuturi 🔗 |
Fri 2:30 p.m. - 3:15 p.m.
|
Designing High-Dimensional Closed-Loop Optimal Control Using Deep Neural Networks
(
Presentation
)
link »
SlidesLive Video » Designing closed-loop optimal control for high-dimensional nonlinear systems remains a persistent challenge. Traditional methods, such as solving the Hamilton-Jacobi-Bellman equation, suffer from the curse of dimensionality. Recent studies introduced a promising supervised learning approach, akin to imitation learning, that uses deep neural networks to learn from open-loop optimal control solutions. In this talk, we'll explore this method, highlighting a limitation in its basic form: the distribution mismatch phenomenon, induced by controlled dynamics. To overcome this, we present an improved approach—the initial value problem enhanced sampling method. This method not only provides a theoretical edge over the basic version in the linear-quadratic regulator but also showcases substantial numerical improvement on various high-dimensional nonlinear problems, including the optimal reaching problem of a 7 DoF manipulator. Notably, our method also surpasses the Dataset Aggregation (DAGGER) algorithm, widely adopted in imitation learning, with significant theoretical and practical advantages. |
Jiequn Han 🔗 |
Fri 4:45 p.m. - 5:30 p.m.
|
Safe Learning in Control
(
Presentation
)
SlidesLive Video » In many applications of autonomy in robotics, guarantees that constraints are satisfied throughout the learning process are paramount. We present a controller synthesis technique based on the computation of reachable sets, using optimal control and game theory. Then, we present methods for combining reachability with learning-based methods, to enable performance improvement while maintaining safety, and to move towards safe robot control with learned models of the dynamics and the environment. We will discuss different interaction models with other agents. Finally, we will illustrate these safe learning methods on robotic platforms at Berkeley, discussing applications in automated airspace management and air taxi operations. |
Claire Tomlin 🔗 |
Fri 5:30 p.m. - 5:45 p.m.
|
Bridging RL Theory and Practice with the Effective Horizon
(
Presentation
)
link »
SlidesLive Video » Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the random policy also have the highest Q-values under the optimal policy—i.e., when it is optimal to act greedily with respect to the random's policy Q function—deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also show that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy. |
Cassidy Laidlaw · Stuart Russell · Anca Dragan 🔗 |
Fri 6:00 p.m. - 6:45 p.m.
|
Reinforcement Learning and Multi-Agent Reinforcement Learning
(
Presentation
)
SlidesLive Video » Reinforcement learning (RL) has emerged as a powerful paradigm for enabling intelligent agents to solve sequential decision-making problems under uncertainties. It has witnessed remarkable successes in various domains, ranging from game-playing agents to autonomous systems. However, as real-world challenges become increasingly intricate and interconnected, there is a need to go beyond the single-agent framework. Multi-agent reinforcement learning (MARL), is an extension of RL that enables multiple agents to learn and interact, introducing a new dimension of complexity and sophistication. This talk delves into the exciting realm of RL and MARL, exploring the foundational principles, recent advancements, and promising applications of these techniques. We begin by introducing the core concepts of RL. Building upon this foundation, we shift our focus to MARL, where multiple agents learn simultaneously, either cooperating or competing with each other. Then, we examine the challenges posed by MARL, including coordination, communication, and the exploration-exploitation dilemma. |
Giorgia Ramponi 🔗 |
Fri 6:45 p.m. - 7:00 p.m.
|
Modeling Accurate Long Rollouts with Temporal Neural PDE Solvers
(
Presentation
)
link »
SlidesLive Video » Time-dependent partial differential equations (PDEs) are ubiquitous in science and engineering. Recently, mostly due to the high computational cost of traditional solution techniques, deep neural network based surrogates have gained increased interest. The practical utility of such neural PDE solvers relies on their ability to provide accurate, stable predictions over long time horizons, which is a notoriously hard problem. In this work, we present a large-scale analysis of common temporal rollout strategies, identifying the neglect of non-dominant spatial frequency information, often associated with high frequencies in PDE solutions, as the primary pitfall limiting stable, accurate rollout performance. Based on these insights, we draw inspiration from recent advances in diffusion models to introduce PDE-Refiner; a novel model class that enables more accurate modeling of all frequency components via a multi-step refinement process. We validate PDE-Refiner on challenging benchmarks of complex fluid dynamics, demonstrating stable and accurate rollouts that consistently outperform state-of-the-art models, including neural, numerical, and hybrid neural-numerical architectures. Finally, PDE-Refiner's connection to diffusion models enables an accurate and efficient assessment of the model's predictive uncertainty, allowing us to estimate when the surrogate becomes inaccurate. |
Phillip Lippe · Bastiaan Veeling · Paris Perdikaris · Richard E Turner · Johannes Brandstetter 🔗 |
-
|
Analyzing the Sample Complexity of Model-Free Opponent Shaping
(
Poster
)
link »
In mixed-incentive multi-agent environments, methods developed for zero-sum games often yield collectively sub-optimal results. Addressing this, \textit{opponent shaping} (OS) strategies aim to actively guide the learning processes of other agents, empirically leading to enhanced individual and group performances. Early OS methods use higher-order derivatives to shape the learning of co-players, making them unsuitable to anticipate multiple learning steps ahead. Follow-up work Model-free Opponent Shaping (M-FOS) addresses the shortcomings of earlier OS methods by reframing the OS problem into a meta-game. In the meta-game, the meta-step corresponds to an episode of the ``inner'' game. The OS meta-state corresponds to the inner policies, while the meta-policy outputs an inner policy at each meta-step. Leveraging model-free optimization techniques, M-FOS learns meta-policies that demonstrate long-horizon opponent shaping, e.g., by discovering a novel extortion strategy in the Iterated Prisoner's Dilemma (IPD). In contrast to early OS methods, there is little theoretical understanding of the M-FOS framework. In this work, we derive the sample complexity bounds for the M-FOS agents theoretically and empirically. To quantify the sample complexity, we adapt the $R_{max}$ algorithm, most prominently used to derive sample bounds for MDPs, as the meta-learner in the M-FOS framework and derive an exponential sample complexity. Our theoretical results are empirically supported in the Matching Pennies environment.
|
Kitty Fung · Qizhen Zhang · Christopher Lu · Timon Willi · Jakob Foerster 🔗 |
-
|
A Best Arm Identification Approach for Stochastic Rising Bandits
(
Poster
)
link »
Stochastic Rising Bandits (SRBs) model sequential decision-making problems in which the expected rewards of the available options increase every time they are selected. This setting captures a wide range of scenarios in which the available options are learning entities whose performance improves (in expectation) over time. While previous works addressed the regret minimization problem, this paper, focuses on the fixed-budget Best Arm Identification (BAI) problem for SRBs. In this scenario, given a fixed budget of rounds, we are asked to provide a recommendation about the best option at the end of the identification process. We propose two algorithms to tackle the above-mentioned setting, namely R-UCBE, which resorts to a UCB-like approach, and R-SR, which employs a successive reject procedure. Then, we prove that, with a sufficiently large budget, they provide guarantees on the probability of properly identifying the optimal option at the end of the learning process. Furthermore, we derive a lower bound on the error probability, matched by our R-SR (up to logarithmic factors), and illustrate how the need for a sufficiently large budget is unavoidable in the SRB setting. Finally, we numerically validate the proposed algorithms in both synthetic and real-world environments and compare them with the currently available BAI strategies. |
Alessandro Montenegro · Marco Mussi · Francesco Trovò · Marcello Restelli · Alberto Maria Metelli 🔗 |
-
|
Tendiffpure: Tensorizing Diffusion Models for Purification
(
Poster
)
link »
Diffusion models are effective purification methods where the noises or adversarial attacks are removed using generative approaches before pre-existing classifiers conducting classification tasks. However, the efficiency of diffusion models is still a concern and existing solutions are based on knowledge distillation which can jeopardize the generation quality because of the small number of generation steps. Hence we propose Tendiffpure as a tensorized diffusion models to compress diffusion models for purification. Unlike the knowledge distillation methods, we directly compress u-nets as backbones of diffusion models using tensor-train decomposition which reduce the number of parameters and captures more spatial information in multi-dimensional data such as images. The space complexity is reduced from $\mathit{O}(N^2)$ to $\mathit{O}(NR^2)$ with $R\leq 4$. Experimental results show that Tendiffpure can more efficiently generate high quality purified results and outperform the baselines purification methods on CIFAR-10, FashionMNIST and MNIST datasets for two noises and one adversarial attack.
|
Zhou Derun · Mingyuan Bai · Qibin Zhao 🔗 |
-
|
Continuous Vector Quantile Regression
(
Poster
)
link »
Vector quantile regression (VQR) estimates the conditional vector quantile function (CVQF), a fundamental quantity which fully represents the conditional distribution of $\mathbf{Y}|\mathbf{X}$. VQR is formulated as an optimal transport (OT) problem between a uniform $\mathbf{U}\sim\mu$ and the target $(\mathbf{X},\mathbf{Y})\sim\nu$, the solution of which is a unique transport map, co-monotonic with $\mathbf{U}$. Recently NL-VQR has been proposed to estimate support non-linear CVQFs, together with fast solvers which enabled the use of this tool in practical applications. Despite its utility, the scalability and estimation quality of NL-VQR is limited due to a discretization of the OT problem onto a grid of quantile levels. We propose a novel _continuous_ formulation and parametrization of VQR using partial input-convex neural networks (PICNNs). Our approach allows for accurate, scalable, differentiable and invertible estimation of non-linear CVQFs.We further demonstrate, theoretically and experimentally, how continuous CVQFs can be used for general statistical inference tasks: estimation of likelihoods, CDFs, confidence sets, coverage, sampling, and more.This work is an important step towards unlocking the full potential of VQR.
|
Sanketh Vedula · Irene Tallini · Aviv A. Rosenberg · Marco Pegoraro · Emanuele Rodola · Yaniv Romano · Alexander Bronstein 🔗 |
-
|
Informed POMDP: Leveraging Additional Information in Model-Based RL
(
Poster
)
link »
In this work, we generalize the problem of learning through interaction in a POMDP by accounting for eventual additional information available at training time. First, we introduce the informed POMDP, a new learning paradigm offering a clear distinction between the training information and the execution observation. Next, we propose an objective for learning sufficient statistics from the history for the optimal control that leverages this information. We then show that this informed objective consists of learning an environment model from which we can sample latent trajectories. Finally, we show for the Dreamer algorithm that the convergence speed of the policies is sometimes greatly improved on several environments by using this informed environment model. Those results and the simplicity of the proposed adaptation advocate for a systematic consideration of eventual additional information when learning in a POMDP using model-based RL. |
Gaspard Lambrechts · Adrien Bolland · Damien Ernst 🔗 |
-
|
Embedding Surfaces by Optimizing Neural Networks with Prescribed Riemannian Metric and Beyond
(
Poster
)
link »
From a machine learning perspective, the problem of solving partial differential equations (PDEs) can be formulated into a least square minimization problem, where neural networks are used to parametrized PDE solutions. Ideally a global minimizer of the square loss corresponds to a solution of the PDE. In this paper we start with a special type of nonlinear PDE arising from differential geometry, the isometric embedding equation, which relates to many long-standing open questions in geometry and analysis. We show that the gradient descent method can identify a global minimizer of the least-square loss function with two-layer neural networks under the assumption of over-parametrization. As a consequence, this solves the surface embedding locally with a prescribed Riemannian metric. We also extend the convergence analysis for gradient descent to higher order linear PDEs with over-parametrization assumption. |
Yi Feng · Sizhe Li · Ioannis Panageas · Xiao Wang 🔗 |
-
|
Taylor TD-learning
(
Poster
)
link »
Many reinforcement learning approaches rely on temporal-difference (TD) learning to learn a critic.However, TD-learning updates can be high variance due to their sole reliance on Monte Carlo estimates of the updates.Here, we introduce a model-based RL framework, Taylor TD, which reduces this variance. Taylor TD uses a first-order Taylor series expansion of TD updates.This expansion allows to analytically integrate over stochasticity in the action-choice, and some stochasticity in the state distribution for the initial state and action of each TD update.We include theoretical and empirical evidence of Taylor TD updates being lower variance than (standard) TD updates. Additionally, we show that Taylor TD has the same stable learning guarantees as (standard) TD-learning under linear function approximation.Next, we combine Taylor TD with the TD3 algorithm (Fujimoto et al., 2018), into TaTD3.We show TaTD3 performs as well, if not better, than several state-of-the art model-free and model-based baseline algorithms on a set of standard benchmark tasks.Finally, we include further analysis of the settings in which Taylor TD may be most beneficial to performance relative to standard TD-learning. |
Michele Garibbo · Maxime Robeyns · Laurence Aitchison 🔗 |
-
|
Toward Understanding Latent Model Learning in MuZero: A Case Study in Linear Quadratic Gaussian Control
(
Poster
)
link »
We study the problem of representation learning for control from partial and potentially high-dimensional observations. We approach this problem via direct latent model learning, where one directly learns a dynamical model in some latent state space by predicting costs. In particular, we establish finite-sample guarantees of finding a near-optimal representation function and a near-optimal controller using the directly learned latent model for infinite-horizon time-invariant Linear Quadratic Gaussian (LQG) control. A part of our approach to latent model learning closely resembles MuZero, a recent breakthrough in empirical reinforcement learning, in that it learns latent dynamics implicitly by predicting cumulative costs. A key technical contribution of this work is to prove persistency of excitation for a new stochastic process that arises from our analysis of quadratic regression in our approach. |
Yi Tian · Kaiqing Zhang · Russ Tedrake · Suvrit Sra 🔗 |
-
|
Balancing exploration and exploitation in Partially Observed Linear Contextual Bandits via Thompson Sampling
(
Poster
)
link »
Contextual bandits constitute a popular framework for studying the exploration-exploitation trade-off under finitely many options with side information. In the majority of the existing works, contexts are assumed perfectly observed, while in practice it is more reasonable to assume that they are observed partially. In this work, we study reinforcement learning algorithms for contextual bandits with partial observations. First, we consider different structures for partial observability and their corresponding optimal policies. Subsequently, we present and analyze reinforcement learning algorithms for partially observed contextual bandits with noisy linear observation structures. For these algorithms that utilize Thompson sampling, we establish estimation accuracy and regret bounds under different structural assumptions. |
Hongju Park · Mohamad Kazem Shirani Faradonbeh 🔗 |
-
|
Leveraging Factored Action Spaces for Off-Policy Evaluation
(
Poster
)
link »
In high-stakes decision-making domains such as healthcare and self-driving cars, off-policy evaluation (OPE) can help practitioners understand the performance of a new policy before deployment by using observational data. However, when dealing with problems involving large and combinatorial action spaces, existing OPE estimators often suffer from substantial bias and/or variance. In this work, we investigate the role of factored action spaces in improving OPE. Specifically, we propose and study a new family of decomposed IS estimators that leverage the inherent factorisation structure of actions. We theoretically prove that our proposed estimator achieves lower variance and remains unbiased, subject to certain assumptions regarding the underlying problem structure. Empirically, we demonstrate that our estimator outperforms standard IS in terms of mean squared error and conduct sensitivity analyses probing the validity of various assumptions. Future work should investigate how to design or derive the factorisation for practical problems so as to maximally adhere to the theoretical assumptions. |
Aaman Rebello · Shengpu Tang · Jenna Wiens · Sonali Parbhoo 🔗 |
-
|
Diffusion Model-Augmented Behavioral Cloning
(
Poster
)
link »
Imitation learning addresses the challenge of learning by observing an expert’s demonstrations without access to reward signals from the environment. Most existing imitation learning methods that do not require interacting with the environment either model the expert distribution as the conditional probability p(a|s) (e.g., behavioral cloning, BC) or the joint probability p(s, a) (e.g., implicit behavioral cloning). Despite its simplicity, modeling the conditional probability with BC usually struggles with generalization. While modeling the joint probability can lead to improved generalization performance, the inference procedure can be time-consuming and it often suffers from manifold overfitting. This work proposes an imitation learning framework that benefits from modeling both the conditional and joint probability of the expert distribution. Our proposed diffusion model-augmented behavioral cloning (DBC) employs a diffusion model trained to model expert behaviors and learns a policy to optimize both the BC loss (conditional) and our proposed diffusion model loss (joint). DBC outperforms baselines in various continuous control tasks in navigation, robot arm manipulation, dexterous manipulation, and locomotion. We design additional experiments to verify the limitations of modeling either the conditional probability or the joint probability of the expert distribution as well as compare different generative models. |
Hsiang-Chun Wang · Shang-Fu Chen · Ming-Hao Hsu · Chun-Mao Lai · Shao-Hua Sun 🔗 |
-
|
Unbalanced Diffusion Schrödinger Bridge
(
Poster
)
link »
Schrödinger bridges (SBs) provide an elegant framework for modeling the temporal evolution of populations in physical, chemical, or biological systems. Such natural processes are commonly subject to changes in population size over time due to the emergence of new species or birth and death events. However, existing neural parameterizations of SBs such as diffusion Schrödinger bridges ( DSBs) are restricted to settings in which the endpoints of the stochastic process are both probability measures and assume conservation of mass constraints. To address this limitation, we introduce unbalanced DSBs which model the temporal evolution of marginals with arbitrary finite mass. This is achieved by deriving the time reversal of stochastic differential equations (SDEs) with killing and birth terms. We present two novel algorithmic schemes that comprise a scalable objective function for training unbalanced DSBs and provide a theoretical analysis alongside challenging applications on predicting heterogeneous molecular single-cell responses to various cancer drugs and simulating the emergence and spread of new viral variants. |
Matteo Pariset · Ya-Ping Hsieh · Charlotte Bunne · Andreas Krause · Valentin De Bortoli 🔗 |
-
|
Aligned Diffusion Schrödinger Bridges
(
Poster
)
link »
Diffusion Schrödinger Bridges (DSBs) have recently emerged as a powerful framework for recovering stochastic dynamics via their marginal observations at different time points. Despite numerous successful applications, existing algorithms for solving DSBs have so far failed to utilize the structure of aligned data, which naturally arises in many biological phenomena. In this paper, we propose a novel algorithmic framework that, for the first time, solves DSBs while respecting the data alignment. Our approach hinges on a combination of two decades-old ideas: The classical Schrödinger bridge theory and Doob's $h$-transform. Compared to prior methods, our approach leads to a simpler training procedure with lower variance, which we further augment with principled regularization schemes. This ultimately leads to sizeable improvements across experiments on synthetic and real data, including the tasks of predicting conformational changes in proteins and temporal evolution of cellular differentiation processes.
|
Vignesh Ram Somnath · Matteo Pariset · Ya-Ping Hsieh · Maria Rodriguez Martinez · Andreas Krause · Charlotte Bunne 🔗 |
-
|
Dynamic Feature-based Newsvendor
(
Poster
)
link »
In this paper, we investigate the dynamic feature-based newsvendor problem within a multi-period inventory control setting featuring backlogged demands. Combining the significance of feature information with a multi-stage decision-making framework, we propose a general dynamic contextual newsvendor model. For this general model, we propose Contextual Value Iteration (CVI) algorithm and obtain its convergence rate to the optimal solution as well as sample complexity result. Our experimental result also demonstrates that our CVI is more efficient than value iteration for the vanilla Markovian Decision Process (MDP). |
Zexing Xu · Ziyi Chen · Xin Chen 🔗 |
-
|
Equivalence Class Learning for GENERIC Systems
(
Poster
)
link »
In recent years, applications of neural networks to the modeling of physical phenomena have attracted much attention. This study proposes a method for learning systems that are described by the GENERIC formalism, which is a combination of analytical mechanics and non-equilibrium thermodynamics. GENERIC systems admit the energy conservation law and the law of increasing entropy under certain conditions. However, designing neural network models that satisfy these conditions is difficult. In this study, we introduce a relaxation model of the GENERIC form, thereby introducing an equivalence class into the set of models. Because the equivalence class of the target model includes a model that can be learned by neural networks, the learned model has the energy conservation law and the law of increasing entropy in high accuracy with respect to the true energy and the true entropy. |
Baige Xu · Yuhan Chen · Takashi Matsubara · Takaharu Yaguchi 🔗 |
-
|
Variational Principle and Variational Integrators for Neural Symplectic Forms
(
Poster
)
link »
In this study, we investigate the variational principle for neural symplectic forms, thereby designing the variational integrators for this model. In recent years, neural networks models for physical phenomena have been attracting much attention. In particular, the neural symplectic form is a method that can model general Hamiltonian systems, which are not necessary in the canonical form. In this paper, we make the following two contributions regarding this model. Firstly, we show that this model is derived from a variational principle and hence admits the Noether theorem.Secondly, when the trained models are used for simulations, they must be discretized using numerical integrators; however, unless carefully designed, numerical integrators destroy physical laws. |
Yuhan Chen · Baige Xu · Takashi Matsubara · Takaharu Yaguchi 🔗 |
-
|
Action and Trajectory Planning for Urban Autonomous Driving with Hierarchical Reinforcement Learning
(
Poster
)
link »
Reinforcement Learning (RL) has made promising progress in planning and decision-making for Autonomous Vehicles (AVs) in simple driving scenarios. However, existing RL algorithms for AVs fail to learn critical driving skills in complex urban scenarios. First, urban driving scenarios require AVs to handle multiple driving tasks of which conventional RL algorithms are incapable. Second, the presence of other vehicles in urban scenarios results in a dynamically changing environment, which challenges RL algorithms to planthe action and trajectory of the AV. In this work, we propose an action and trajectory planner using Hierarchical Reinforcement Learning (atHRL) method, which models the agent behavior in a hierarchical model by using the mid-level perception of the lidar and birdeye view. The proposed atHRL method learns to make decisions about the agent’s future trajectory and computes target waypoints under continuous settings based on a hierarchical DDPG algorithm. The waypoints planned by the atHRL model are then sent to a low-level controller to generate the steering and throttle commands required for the vehicle maneuver. We empirically verify the efficacy of atHRL through extensive experiments in complex urban driving scenarios that compose multiple tasks with the presence of other vehicles in the CARLA simulator. The experimental results suggest a significant performance improvement compared to the state-of-the-art RL methods. |
Xinyang Lu · Xiaofeng Fan · Tianying Wang 🔗 |
-
|
Accelerated Policy Gradient: On the Nesterov Momentum for Reinforcement Learning
(
Poster
)
link »
Policy gradient methods have recently been shown to enjoy global convergence at a $\Theta(1/t)$ rate in the non-regularized tabular softmax setting. Accordingly, one important research question is whether this convergence rate can be further improved, with only first-order updates. In this paper, we answer the above question from the perspective of momentum by adapting the celebrated Nesterov's accelerated gradient (NAG) method to reinforcement learning (RL), termed *Accelerated Policy Gradient* (APG). To demonstrate the potential of APG in achieving faster global convergence, we start from the bandit setting and formally show that with the true gradient, APG with softmax policy parametrization converges to an optimal policy at a $\tilde{O}(1/t^2)$ rate. To the best of our knowledge, this is the first characterization of the global convergence rate of NAG in the context of RL. Notably, our analysis relies on one interesting finding: Regardless of the initialization, APG could end up reaching a locally-concave regime, where APG could benefit significantly from the momentum, within finite iterations. By means of numerical validation, we confirm that APG exhibits $\tilde{O}(1/t^2)$ rate in the bandit setting and still preserves the $\tilde{O}(1/t^2)$ rate in various Markov decision process instances, showing that APG could significantly improve the convergence behavior over the standard policy gradient.
|
Yen-Ju Chen · Nai-Chieh Huang · Ping-Chun Hsieh 🔗 |
-
|
Exponential weight averaging as damped harmonic motion
(
Poster
)
link »
The exponential moving average (EMA) is a commonly used statistic for providing stable estimates of stochastic quantities in deep learning optimization. Recently, EMA has seen considerable use in generative models, where it is computed with respect to the model weights, and significantly improves the stability of the inference model during and after training. While the practice of weight averaging at the end of training is well-studied and known to improve estimates of local optima, the benefits of EMA over the course of training is less understood. In this paper, we derive an explicit connection between EMA and a damped harmonic system between two particles, where one particle (the EMA weights) is drawn to the other (the model weights) via an idealized zero-length spring. We then leverage this physical analogy to analyze the effectiveness of EMA, and propose an improved training algorithm, which we call \methodname{}. Finally, we demonstrate theoretically and empirically several advantages enjoyed by \methodname{} over standard EMA. |
Jonathan Patsenker · Henry Li · Yuval Kluger 🔗 |
-
|
Algorithms for Optimal Adaptation of Diffusion Models to Reward Functions
(
Poster
)
link »
We develop algorithms for adapting pretrained diffusion models to optimize reward functions while retaining fidelity to the pretrained model. We propose a general framework for this adaptation that trades off fidelity to a pretrained diffusion model and achieving high reward. Our algorithms take advantage of the continuous nature of diffusion processes to pose reward-based learning either as a trajectory optimization or continuous state reinforcement learning problem. We demonstrate the efficacy of our approach across several application domains, including the generation of time series of household power consumption and images satisfying specific constraints like the absence of memorized images or corruptions. |
Krishnamurthy Dvijotham · Shayegan Omidshafiei · Kimin Lee · Katie Collins · Deepak Ramachandran · Adrian Weller · Mohammad Ghavamzadeh · Milad Nasresfahani · Ying Fan · Jeremiah Liu 🔗 |
-
|
On learning history-based policies for controlling Markov decision processes
(
Poster
)
link »
Reinforcement learning (RL) folklore suggests that history-based function approximation methods, such as recurrent neural nets or history-based state abstraction, perform better than their memory-less counterparts, due to the fact that function approximation in Markov decision processes (MDP) can be viewed as inducing a Partially observable MDP. However, there has been little formal analysis of such history-based algorithms, as most existing frameworks focus exclusively on memory-less features. In this paper, we introduce a theoretical framework for studying the behaviour of RL algorithms that learn to control an MDP using history-based feature abstraction mappings. Furthermore, we use this framework to design a practical RL algorithm and we numerically evaluate its effectiveness on a set of continuous control tasks. |
Gandharv Patil · Aditya Mahajan · Doina Precup 🔗 |
-
|
Visual Dexterity: In-hand Dexterous Manipulation from Depth
(
Poster
)
link »
In-hand object reorientation is necessary for performing many dexterous manipulation tasks, such as tool use in unstructured environments that remain beyond the reach of current robots. Prior works built reorientation systems that assume one or many of the following specific circumstances: reorienting only specific objects with simple shapes, limited range of reorientation, slow or quasi-static manipulation, etc. We overcome these limitations and present a general object reorientation controller that is trained in simulation and evaluated in the real world. Our system uses readings from a single commodity depth camera to dynamically reorient complex objects by any amount in real time. The controller generalizes to new objects not used during training. It even demonstrates some capability of reorienting objects in the air held by a downward-facing hand that must counteract gravity during reorientation. |
Tao Chen · Megha Tippur · Siyang Wu · Vikash Kumar · Edward Adelson · Pulkit Agrawal 🔗 |
-
|
Learning from Sparse Offline Datasets via Conservative Density Estimation
(
Poster
)
link »
Offline reinforcement learning (RL) offers a promising direction for learning policies from pre-collected datasets without requiring further interactions with the environment. However, existing methods struggle to handle out-of-distribution (OOD) extrapolation errors, especially in sparse reward or scarce data settings. In this paper, we propose a novel training algorithm called Conservative Density Estimation (CDE), which addresses this challenge by explicitly imposing constraints on the state-action occupancy stationary distribution. CDE overcomes the limitations of existing approaches, such as the stationary distribution correction method, by addressing the support mismatch issue in marginal importance sampling. Our method achieves state-of-the-art performance on the D4RL benchmark. Notably, CDE consistently outperforms baselines in challenging tasks with sparse rewards or insufficient data, demonstrating the advantages of our approach in addressing the extrapolation error problem in offline RL. |
Zhepeng Cen · Zuxin Liu · Zitong Wang · Yihang Yao · Henry Lam · Ding Zhao 🔗 |
-
|
Undo Maps: A Tool for Adapting Policies to Perceptual Distortions
(
Poster
)
link »
People adapt to changes in their visual field all the time, like when their vision is occluded while driving. Agents trained with RL struggle to do the same. Here, we address how to transfer knowledge acquired in one domain to another when the domains differ in their state representation. For example, a policy may have been trained in an environment where states were represented as colored images, but we would now like to deploy this agent in a domain where images appear black-and-white. We propose \textsc{Tail}--task-agnostic imitation learning--a framework which learns to undo these kinds of changes between domains in order to achieve transfer. This enables an agent, regardless of the task it was trained for, to adapt to perceptual distortions by first mapping the states in the new domain, such as gray-scale images, back to the original domain where they appear in color, and then by acting with the same policy. Our procedure depends on an optimal transport formulation between trajectories in the two domains, shows promise in simple experimental settings, and resembles algorithms from imitation learning. |
Abhi Gupta · Ted Moskovitz · David Alvarez-Melis · Aldo Pacchiano 🔗 |
-
|
When is Agnostic Reinforcement Learning Statistically Tractable?
(
Poster
)
link »
We study the problem of agnostic PAC reinforcement learning (RL): given a policy class $\Pi$, how many rounds of interaction with an unknown MDP (with a potentially large state and action space) are required to learn an $\epsilon$-suboptimal policy with respect to (\Pi)? Towards that end, we introduce a new complexity measure, called the spanning capacity, that depends solely on the set (\Pi) and is independent of the MDP dynamics. With a generative model, we show that the spanning capacity characterizes PAC learnability for every policy class $\Pi$. However, for online RL, the situation is more subtle. We show there exists a policy class $\Pi$ with a bounded spanning capacity that requires a superpolynomial number of samples to learn. This reveals a surprising separation for agnostic learnability between generative access and online access models (as well as between deterministic/stochastic MDPs under online access). On the positive side, we identify an additional sunflower structure which in conjunction with bounded spanning capacity enables statistically efficient online RL via a new algorithm called POPLER, which takes inspiration from classical importance sampling methods as well as recent developments for reachable-state identification and policy evaluation in reward-free exploration.
|
Gene Li · Zeyu Jia · Alexander Rakhlin · Ayush Sekhari · Nati Srebro 🔗 |
-
|
Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport
(
Poster
)
link »
Continuous normalizing flows (CNFs) are an attractive generative modeling technique, but they have been held back by limitations in their simulation-based maximum likelihood training. We introduce the generalized \textit{conditional flow matching} (CFM) technique, a family of simulation-free training objectives for CNFs. CFM features a stable regression objective like that used to train the stochastic flow in diffusion models but enjoys the efficient inference of deterministic flow models. In contrast to both diffusion models and prior CNF training algorithms, CFM does not require the source distribution to be Gaussian or require evaluation of its density. A variant of our objective is optimal transport CFM (OT-CFM), which creates simpler flows that are more stable to train and lead to faster inference, as evaluated in our experiments. Furthermore, OT-CFM is the first method to compute dynamic OT in a simulation-free way. Training CNFs with CFM improves results on a variety of conditional and unconditional generation tasks, such as inferring single cell dynamics, unsupervised image translation, and Schrödinger bridge inference. |
Alexander Tong · Nikolay Malkin · Guillaume Huguet · Yanlei Zhang · Jarrid Rector-Brooks · Kilian Fatras · Guy Wolf · Yoshua Bengio 🔗 |
-
|
In-Context Decision-Making from Supervised Pretraining
(
Poster
)
link »
Large transformer models trained on diverse datasets have shown a remarkable ability to learn in-context, achieving high few-shot performance on tasks they were not explicitly trained to solve. In this paper, we study the in-context learning capabilities of transformers in decision-making problems, i.e., bandits and Markov decision processes. To do so, we introduce and study a supervised pretraining method where the transformer predicts an optimal action given a query state and an in-context dataset of interactions, across a diverse set of tasks. This procedure, while simple, produces an in-context algorithm with several surprising capabilities. We observe that the pretrained transformer can be used to solve a range of decision-making problems, exhibiting both exploration online and conservatism offline, despite not being explicitly trained to do so. It also generalizes beyond the pretraining distribution to new tasks and automatically adapts its decision-making strategies to unknown structure. Theoretically, we show the pretrained transformer can be viewed as an implementation of posterior sampling. We further leverage this connection to provide guarantees on its regret, and prove that it can learn a decision-making algorithm stronger than a source algorithm used to generate its pretraining data. These results suggest a promising yet simple path towards instilling strong in-context decision-making abilities in transformers. |
Jonathan Lee · Annie Xie · Aldo Pacchiano · Yash Chandak · Chelsea Finn · Ofir Nachum · Emma Brunskill 🔗 |
-
|
Statistics estimation in neural network training: a recursive identification approach
(
Poster
)
link »
A common practice in mini-batch neural network training is to estimate global statistics using exponential moving averages (EMA). However, such methods can be sensitive to the EMA decay parameter, which is typically set by hand. In this paper, we introduce Adaptive Linear State Estimation (ALiSE), an online method for adapting the parameters of a linear estimation model such as an EMA. Our work establishes a connection between parameter estimation methods in deep learning, including ALiSE, and recursive identification techniques in control theory. We apply ALiSE to a range of deep learning scenarios and show that it can learn sensible schedules for the EMA decay parameter. Compared to the naive EMA baseline, ALiSE leads to matching or accelerated convergence during training. |
Ruth Crasto · Xuchan Bao · Roger Grosse 🔗 |
-
|
Learning to Optimize with Recurrent Hierarchical Transformers
(
Poster
)
link »
Learning to optimize (L2O) has received a lot of attention recently because of its potential to leverage data to outperform hand-designed optimization algorithms such as Adam. Typically, these learned optimizers are meta-learned on optimization tasks to achieve rapid convergence. However, they can suffer from high meta-training costs and memory overhead. Recent attempts have been made to reduce the computational costs of these learned optimizers by introducing a hierarchy that enables them to perform most of the heavy computation at the tensor (layer) level rather than the parameter level. This not only leads to sublinear memory cost with respect to number of parameters, but also allows for a higher representation capacity for efficient learned optimization. To this end, we propose an efficient transformer-based learned optimizer which facilitates communication among tensors with self-attention and keeps track of optimization history with recurrence. We show that our optimizer learns to optimize better than strong learned optimizer baselines at a comparable memory overhead, thereby suggesting encouraging scaling trends. |
Abhinav Moudgil · Boris Knyazev · Guillaume Lajoie · Eugene Belilovsky 🔗 |
-
|
Fixed-Budget Hypothesis Best Arm Identification: On the Information Loss in Experimental Design
(
Poster
)
link »
Experimental design plays a crucial role in evidence-based science with multiple treatment arms, such as online advertisements or medical treatments. This study addresses the task of identifying the best treatment arm, which has the highest expected outcome among multiple treatment arms We investigate the influence of available information regarding the distributions of treatment arms in experiments. In our experimental setup, we first designate a hypothetical ``best'' treatment arm and then conduct an experiment to verify whether this hypothetically best treatment arm is indeed the 'true' best treatment arm. Our null hypothesis posits that the hypothetical best treatment is not the actual best, and our objective is to minimize the likelihood of recommending other treatment arms when the null hypothesis is false; in other words, when the true best treatment arm is the same as the hypothetical best treatment. We demonstrate that the optimal experimental design significantly depends on knowledge about distributional information, examined through an information-theoretic approach. Specifically, we discuss worst-case scenarios, characterized by a loss of distributional information, as circumstances when gaps between the expected outcomes of the best and sub-optimal treatment arms convege to zero. After discussing asymptotic optimality, we propose an experimental design informed by the available information. |
Masahiro Kato · Masaaki Imaizumi · Takuya Ishihara · Toru Kitagawa 🔗 |
-
|
Unbalanced Optimal Transport meets Sliced-Wasserstein
(
Poster
)
link »
Optimal transport (OT) has emerged as a powerful framework to compare probability measures, a fundamental task in many statistical and machine learning problems. Substantial advances have been made over the last decade in designing OT variants which are either computationally and statistically more efficient, or more robust to the measures/datasets to compare. Among them, sliced OT distances have been extensively used to mitigate optimal transport's cubic algorithmic complexity and curse of dimensionality. In parallel, unbalanced OT was designed to allow comparisons of more general positive measures, while being more robust to outliers. In this paper, we propose to combine these two concepts, namely slicing and unbalanced OT, to develop a general framework for efficiently comparing positive measures. We propose two new loss functions based on the idea of slicing unbalanced OT, and study their induced topology and statistical properties. We then develop a fast Frank-Wolfe-type algorithm to compute these losses, and show that our methodology is modular as it encompasses and extends prior related work. We finally conduct an empirical analysis of our loss functions and methodology on both synthetic and real datasets, to illustrate their relevance and applicability. |
Thibault Sejourne · Clément Bonet · Kilian Fatras · Kimia Nadjahi · Nicolas Courty 🔗 |
-
|
Improved sampling via learned diffusions
(
Poster
)
link »
Recently, a series of papers proposed deep learning-based approaches to sample from unnormalized target densities using controlled diffusion processes. In this work, we identify these approaches as special cases of the Schrödinger bridge problem, seeking the most likely stochastic evolution between a given prior distribution and the specified target, and propose the perspective from measures on path space as a unifying framework. The optimal controls of such entropy-constrained optimal transport problems can then be described by systems of partial differential equations and corresponding backward stochastic differential equations. Building on these optimality conditions and exploiting the path measure perspective, we obtain variational formulations of the respective approaches and recover the objectives which can be approached via gradient descent. Our formulations allow us to introduce losses different from the typically employed reverse Kullback-Leibler divergence that is known to suffer from mode collapse. In particular, we propose the so-called log-variance loss, which exhibits favorable numerical properties and leads to significantly improved performance across all considered approaches. |
Julius Berner · Lorenz Richter · Guan-Horng Liu 🔗 |
-
|
Stability of Multi-Agent Learning: Convergence in Network Games with Many Players
(
Poster
)
link »
The behaviour of multi-agent learning in many player games has been shown to display complex dynamics outside of restrictive examples such as network zero-sum games. In addition, it has been shown that convergent behaviour is less likely to occur as the number of players increase. To make progress in resolving this problem, we study Q-Learning dynamics and determine a sufficient condition for the dynamics to converge to a unique equilibrium in any network game. We find that this condition depends on the nature of pairwise interactions and on the network structure, but is explicitly independent of the total number of agents in the game. We evaluate this result on a number of representative network games and show that, under suitable network conditions, stable learning dynamics can be achieved with an arbitrary number of agents. |
Aamal Hussain · Dan Leonte · Francesco Belardinelli · Georgios Piliouras 🔗 |
-
|
Limited Information Opponent Modeling
(
Poster
)
link »
The goal of opponent modeling is to model the opponent policy to maximize the reward of the main agent. Most prior works fail to effectively handle scenarios where opponent information is limited. To this end, we propose a Limited Information Opponent Modeling (LIOM) approach that extracts opponent policy representations across episodes using only self-observations. LIOM introduces a novel policy-based data augmentation method that extracts opponent policy representations offline via contrastive learning and incorporates them as additional inputs for training a general response policy. During online testing, LIOM dynamically responds to opponent policies by extracting opponent policy representations from recent historical trajectory data and combining them with the general policy. Moreover, LIOM ensures a lower bound on expected rewards through a balance between conservative and exploitation. Experimental results demonstrate that LIOM is able to accurately extract opponent policy representations even when the opponent's information is limited, and has a certain degree of generalization ability for unknown policies, outperforming existing opponent modeling algorithms. |
Yongliang Lv · Yan Zheng · jianye Hao 🔗 |
-
|
Game Theoretic Neural ODE Optimizer
(
Poster
)
link »
In this work, we present a novel Game Theoretic Neural Ordinary Differential Equation (Neural ODE) optimizer based on the minimax Differential Dynamic Programming paradigm. As neural networks and neural ODEs tend to be vulnerable to attacks, and their predictions are fragile in the presence of adversarial examples, we aim to design a robust game theoretic optimizer based on principles of Min-Max Optimal Control. By formulating Neural ODE optimization as a Min-Max Optimal Control Problem, our proposed algorithm aims to enhance the robustness of neural networks against adversarial attacks by finding policies that perform well under worst-case scenarios. Leveraging recent advances in the interpretation of Neural ODE training through an Optimal Control Problem perspective, we extend recent second order optimization techniques to a game theoretic setting and adapt them to our proposed method. This allows our optimizer toefficiently handle the increased complexity stemming from the computation of double the amount of learnable parameters. The resulting optimizer, Game Theoretic Second-Order Neural Optimizer (GTSONO), enables more effective exploration of the control policy space, leading to improved robustness against adversarial attacks. Experimental evaluations on benchmark datasets demonstrate the superiority of GTSONO compared to existing state-of-the-art optimizers in terms of both performance and efficiency against state-of-the-artadversarial defense methods. |
Panagiotis Theodoropoulos · Guan-Horng Liu · Tianrong Chen · Evangelos Theodorou 🔗 |
-
|
A neural RDE approach for continuous-time non-Markovian stochastic control problems
(
Poster
)
link »
We propose a novel framework for solving continuous-time non-Markovian stochastic optimal problems by means of neural rough differential equations (Neural RDEs) introduced in Morrill et al. (2021). Non-Markovianity naturally arises in control problems due to the time delay effects in the system coefficients or the driving noises, which leads to optimal control strategies depending explicitly on the historical trajectories of the system state. By modelling the control process as the solution of a Neural RDE driven by the state process, we show that the control-state joint dynamics are governed by an uncontrolled, augmented Neural RDE, allowing for fast Monte-Carlo estimation of the value function via trajectories simulation and memory-efficient back-propagation. We provide theoretical underpinnings for the proposed algorithmic framework by demonstrating that Neural RDEs serve as universal approximators for functions of random rough paths. Exhaustive numerical experiments on non-Markovian stochastic control problems are presented, which reveal that the proposed framework is time-resolution-invariant and achieves higher accuracy and better stability in irregular sampling compared to existing RNN-based approaches. |
Melker Höglund · Emilio Ferrucci · Camilo Hernández · Aitor Muguruza Gonzalez · Cristopher Salvi · Leandro Sánchez-Betancourt · Yufei Zhang 🔗 |
-
|
On First-Order Meta-Reinforcement Learning with Moreau Envelopes
(
Poster
)
link »
Meta-Reinforcement Learning (MRL) is a promising framework for training agents that can quickly adapt to new environments and tasks. In this work, we study the MRL problem under the policy gradient formulation, where we propose a novel algorithm that uses Moreau envelope surrogate regularizers to jointly learn a meta-policy that is adjustable to the environment of each individual task. Our algorithm, called Moreau Envelope Meta-Reinforcement Learning (MEMRL), learns a meta-policy that can adapt to a distribution of tasks by efficiently updating the policy parameters using a combination of gradient-based optimization and Moreau Envelope regularization. Moreau Envelopes provide a smooth approximation of the policy optimization problem, which enables us to apply standard optimization techniques and converge to an appropriate stationary point. We provide a detailed analysis of the MEMRL algorithm, where we show a sublinear convergence rate to a first-order stationary point for non-convex policy gradient optimization. We finally show the effectiveness of MEMRL on a multi-task 2D-navigation problem. |
Mohammad Taha Toghani · Sebastian Perez-Salazar · Cesar Uribe 🔗 |
-
|
Vector Quantile Regression on Manifolds
(
Poster
)
link »
Quantile regression (QR) is a statistical tool for distribution-free estimation of conditional quantiles of a target variable given explanatory features.QR is limited by the assumption that the target distribution is univariate and defined on an Euclidean domain.Although the notion of quantiles was recently extended to multi-variate distributions,QR for multi-variate distributions on manifolds remains underexplored, even though many important applications inherently involve data distributed on, e.g., spheres (climate measurements), tori (dihedral angles in proteins), or Lie groups (attitude in navigation).By leveraging optimal transport theory and the notion of $c$-concave functions, we meaningfully define conditional vector quantile functions of high-dimensional variables on manifolds (M-CVQFs).Our approach allows for quantile estimation, regression, and computation of conditional confidence sets.We demonstrate the approach's efficacy and provide insights regarding the meaning of non-Euclidean quantiles through preliminary synthetic data experiments.
|
Marco Pegoraro · Sanketh Vedula · Aviv A. Rosenberg · Irene Tallini · Emanuele Rodola · Alexander Bronstein 🔗 |
-
|
Learning with Learning Awareness using Meta-Values
(
Poster
)
link »
Gradient-based learning in multi-agent systems is difficult because the gradient derives from a first-order model which does not account for the interaction between agents' learning processes.LOLA (Foerster at al, 2018) accounts for this by differentiating through one step of optimization.We extend the ideas of LOLA and develop a fully-general value-based approach to optimization.At the core is a function we call the meta-value, which at each point in joint-policy space gives for each agent a discounted sum of its objective over future optimization steps.We argue that the gradient of the meta-value gives a more reliable improvement direction than the gradient of the original objective, because the meta-value derives from empirical observations of the effects of optimization.We show how the meta-value can be approximated by training a neural network to minimize TD error along optimization trajectories in which agents follow the gradient of the meta-value.We analyze the behavior of our method on the Logistic Game (Letcher 2018) and on the Iterated Prisoner's Dilemma. |
Tim Cooijmans · Milad Aghajohari · Aaron Courville 🔗 |
-
|
Kernel Mirror Prox and RKHS Gradient Flow for Mixed Functional Nash Equilibrium
(
Poster
)
link »
The theoretical analysis of machine learning algorithms, such as deep generative modeling, motivates multiple recent works on the Mixed Nash Equilibrium (MNE) problem.Different from MNE,this paper formulates theMixed Functional Nash Equilibrium (MFNE),which replaces one of the measure optimization problems with optimization over a class of dual functions, e.g., the reproducing kernel Hilbert space (RKHS) in the case of Mixed Kernel Nash Equilibrium (MKNE).We show that our MFNE and MKNE framework form the backbones that govern several existing machine learning algorithms, such as implicit generative models, distributionally robust optimization (DRO), and Wasserstein barycenters.To model the infinite-dimensional continuous-limit optimization dynamics,we propose the Interacting Wasserstein-Kernel Gradient Flow, which includes the RKHS flow that is much less common than the Wasserstein gradient flow but enjoys a much simpler convexity structure.Time-discretizing this gradient flow, we propose a primal-dual kernel mirror prox algorithm, which alternates between a dual step in the RKHS, and a primal step in the space of probability measures.We then provide the first unified convergence analysis of our algorithm for this class of MKNE problems,which establishes a convergence rate of $O(1/N)$ in the deterministic case and $O(1/\sqrt{N})$ in the stochastic case.As a case study, we apply our analysis to DRO, providing the first primal-dual convergence analysis for DRO with probability-metric constraints.
|
Pavel Dvurechenskii · Jia-Jie Zhu 🔗 |
-
|
Simulation-Free Schrödinger Bridges via Score and Flow Matching
(
Poster
)
link »
We present simulation-free score and flow matching ([SF]$^2$M), a simulation-free objective for inferring stochastic dynamics given unpaired source and target samples drawn from arbitrary distributions. Our method generalizes both the score-matching loss used in the training of diffusion models and the recently proposed flow matching loss used in the training of continuous normalizing flows. [SF]$^2$M interprets continuous-time stochastic generative modeling as a Schr\"odinger bridge (SB) problem. It relies on static entropy-regularized optimal transport, or a minibatch approximation, to efficiently learn the SB without simulating the learned stochastic process. We find that [SF]$^2$M is more efficient and gives more accurate solutions to the SB problem than simulation-based methods from prior work. Finally, we apply [SF]$^2$M to the problem of learning cell dynamics from snapshot data. Notably, [SF]$^2$M is the first method to accurately model cell dynamics in high dimensions and can recover known gene regulatory networks from simulated data.
|
Alexander Tong · Nikolay Malkin · Kilian Fatras · Lazar Atanackovic · Yanlei Zhang · Guillaume Huguet · Guy Wolf · Yoshua Bengio 🔗 |
-
|
Latent Space Editing in Transformer-Based Flow Matching
(
Poster
)
link »
This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone offers the potential for scalable and high-quality generative modeling, but their latent structure and editing ability are as of yet unknown. Hence, we adopt this setting and explore how to edit images through latent space manipulation. We introduce an editing space, we call $u$-space, that can be manipulated in a controllable, accumulative, and composable manner. Additionally, we propose a tailored sampling solution to enable sampling with the more efficient adaptive step-size ODE solvers. Lastly, we put forth a straightforward yet powerful method for achieving fine-grained and nuanced editing using text prompts. Our framework is simple and efficient, all while being highly effective at editing images while preserving the essence of the original content.We will provide our source code and include it in the appendix.
|
Tao Hu · David Zhang · Meng Tang · Pascal Mettes · Deli Zhao · Cees Snoek 🔗 |
-
|
Structured State Space Models for In-Context Reinforcement Learning
(
Poster
)
link »
Structured state space sequence (S4) models have recently achieved state-of-the-art performance on long-range sequence modeling tasks. These models also have fast inference speeds and parallelisable training, making them potentially useful in many reinforcement learning settings. We propose a modification to a variant of S4 that enables us to initialise and reset the hidden state in parallel, allowing us to tackle reinforcement learning tasks. We show that our modified architecture runs asymptotically faster than Transformers and performs better than LSTM models on a simple memory-based task. Then, by leveraging the model’s ability to handle long-range sequences, we achieve strong performance on a challenging meta-learning task in which the agent is given a randomly-sampled continuous control environment, combined with a randomly-sampled linear projection of the environment's observations and actions. Furthermore, we show the resulting model can adapt to out-of-distribution held-out tasks. Overall, the results presented in this paper suggest that the S4 models are a strong contender for the default architecture used for in-context reinforcement learning. |
Christopher Lu · Yannick Schroecker · Albert Gu · Emilio Parisotto · Jakob Foerster · Satinder Singh · Feryal Behbahani 🔗 |
-
|
Maximum State Entropy Exploration using Predecessor and Successor Representations
(
Poster
)
link »
Animals have a developed ability to explore that aids them in important tasks such as locating food, exploring for shelter, and finding misplaced items. These exploration skills necessarily track where they have been so that they can plan for finding items with relative efficiency. Contemporary exploration algorithms often learn a less efficient exploration strategy because they either condition only on the current state or simply rely on making random open-loop exploratory moves. In this work, we propose $\eta\psi$-Learning, a method to learn efficient exploratory policies by conditioning on past episodic experience to make the next exploratory move. Specifically, $\eta\psi$-Learning learns an exploration policy that maximizes the entropy of the state visitation distribution of a single trajectory. Furthermore, we demonstrate how variants of the predecessor representation and successor representations can be combined to predict the state visitation entropy. Our experiments demonstrate the efficacy of the proposed algorithm to strategically explore the environment and maximize the state coverage with limited samples.
|
Arnav Kumar Jain · Lucas Lehnert · Irina Rish · Glen Berseth 🔗 |
-
|
PAC-Bayesian Bounds for Learning LTI-ss systems with Input from Empirical Loss
(
Poster
)
link »
In this paper we derive a Probably Approximately Correct(PAC)-Bayesian error bound for linear time-invariant (LTI) stochastic dynamical systems with inputs. Such boundsare widespread in machine learning, and they are useful for characterizing the predictive power of models learned from finitely many data points. In particular, the bound derived in this paper relatesfuture average prediction errors with the prediction error generated by the model on the data used for learning.In turn, this allows us to provide finite-sample error bounds fora wide class of learning/system identification algorithms. Furthermore, as LTI systems are a sub-class of recurrent neuralnetworks (RNNs), these error bounds could be a first step towards PAC-Bayesian bounds for RNNs. |
Deividas Eringis · john leth · Rafal Wisniewski · Zheng-Hua Tan · Mihaly Petreczky 🔗 |
-
|
Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding
(
Poster
)
link »
A prominent challenge of offline reinforcement learning (RL) is the issue of hidden confounding: unobserved variables may influence both the actions taken by the agent and the observed outcomes. Hidden confounding can compromise the validity of any causal conclusion drawn from data and presents a major obstacle to effective offline RL. In the present paper, we tackle the problem of hidden confounding in the nonidentifiable setting. We propose a definition of uncertainty due to hidden confounding bias, termed delphic uncertainty, which uses variation over world models compatible with the observations, and differentiate it from the well-known epistemic and aleatoric uncertainties. We derive a practical method for estimating the three types of uncertainties, and construct a pessimistic offline RL algorithm to account for them. Our method does not assume identifiability of the unobserved confounders, and attempts to reduce the amount of confounding bias. We demonstrate through extensive experiments and ablations the efficacy of our approach on a sepsis management benchmark, as well as on electronic health records. Our results suggest that nonidentifiable hidden confounding bias can be mitigated to improve offline RL solutions in practice. |
Alizée Pace · Hugo Yèche · Bernhard Schölkopf · Gunnar Ratsch · Guy Tennenholtz 🔗 |
-
|
Preventing Reward Hacking with Occupancy Measure Regularization
(
Poster
)
link »
Reward hacking occurs when an agent exploits its specified reward function to behave in undesirable or unsafe ways. Aside from better alignment between the specified reward function and the system designer's intentions, a more feasible proposal to prevent reward hacking is to regularize the learned policy to some safe baseline. Current research suggests that regularizing the learned policy's action distributions to be more similar to those of a safe policy can mitigate reward hacking; however, this approach fails to take into account the disproportionate impact that some actions have on the agent’s state. Instead, we propose a method of regularization based on occupancy measures, which capture the proportion of time each policy is in a particular state-action pair during trajectories. We show theoretically that occupancy-based regularization avoids many drawbacks of action distribution-based regularization, and we introduce an algorithm called ORPO to practically implement our technique. We then empirically demonstrate that occupancy measure-based regularization is superior in both a simple gridworld and a more complex autonomous vehicle control environment. |
Cassidy Laidlaw · Shivam Singhal · Anca Dragan 🔗 |
-
|
Regret Bounds for Risk-sensitive Reinforcement Learning with Lipschitz Dynamic Risk Measures
(
Poster
)
link »
We study finite episodic Markov decision processes incorporating dynamic risk measures to capture risk sensitivity. To this end, we present two model-based algorithms applied to \emph{Lipschitz} dynamic risk measures, a wide range of risk measures that subsumes spectral risk measure, optimized certainty equivalent, and distortion risk measures, among others. We establish both regret upper bounds and lower bounds. Notably, our upper bounds demonstrate optimal dependencies on the number of actions and episodes while reflecting the inherent trade-off between risk sensitivity and sample complexity. Additionally, we substantiate our theoretical results through numerical experiments. |
Hao Liang · Zhi-Quan Luo 🔗 |
-
|
AbODE: Ab initio antibody design using conjoined ODEs
(
Poster
)
link »
Antibodies are Y-shaped proteins that neutralize pathogens and constitute the core of our adaptive immune system. De novo generation of new antibodies that target specific antigens holds the key to accelerating vaccine discovery. However, this co-design of the amino acid sequence and the 3D structure subsumes and accentuates, some central challenges from multiple tasks, including protein folding (sequence to structure), inverse folding (structure to sequence), and docking (binding). We strive to surmount these challenges with a new generative model AbODE that extends graph PDEs to accommodate both contextual information and external interactions. Unlike existing approaches, AbODE uses a single round of full-shot decoding, and elicits continuous differential attention that encapsulates, and evolves with, latent interactions within the antibody as well as those involving the antigen. We unravel fundamental connections between AbODE and temporal networks as well as graph-matching networks. The proposed model significantly outperforms existing methods on standard metrics across benchmarks. |
Yogesh Verma · Markus Heinonen · Vikas K Garg 🔗 |
-
|
Randomized methods for computing optimal transport without regularization and their convergence analysis
(
Poster
)
link »
The optimal transport (OT) problem can be reduced to a linear programming (LP) problem through discretization. In this paper, we introduce the random block coordinate descent (RBCD) methods to directly solve this LP problem. Our approach involves restricting the potentially large-scale optimization problem to small LP subproblems constructed via randomly chosen working sets. By using a random Gauss-Southwell-$q$ rule to select these working sets, we equip the vanilla version of ($\bf \text{RBCD}_0$) with almost sure convergence and a linear convergence rate to solve general standard LP problems. To further improve the efficiency of the ($\bf \text{RBCD}_0$) method, we explore the special structure of constraints in the OT problems and propose several approaches for refining the random working set selection and accelerating the vanilla method. Our preliminary numerical experiments demonstrate that the accelerated random block coordinate descent ($\bf \text{ARBCD}$) method is comparable to Sinkhorn's algorithm when seeking solutions with relatively high accuracy, and offers the advantage of saving memory.
|
Yue Xie · Zhongjian Wang · Zhiwen Zhang 🔗 |
-
|
Sub-linear Regret in Adaptive Model Predictive Control
(
Poster
)
link »
We consider the problem of adaptive Model Predictive Control (MPC) for uncertain linear-systems with additive disturbances and with state and input constraints. We present STT-MPC (Self-Tuning Tube-based Model Predictive Control), an online algorithm that combines the certainty-equivalence principle and polytopic tubes. Specifically, at any given step, STT-MPC infers the system dynamics using the Least Squares Estimator (LSE), and applies a controller obtained by solving an MPC problem using these estimates. The use of polytopic tubes is so that, despite the uncertainties, state and input constraints are satisfied, and recursive-feasibility and asymptotic stability hold. In this work, we analyze the regret of the algorithm, when compared to an oracle algorithm initially aware of the system dynamics. We establish that STT-MPC expected regret does not exceed $O(T^{1/2 + \epsilon})$, where $\epsilon \in (0,1)$ is a design parameter tuning the persistent excitation component of the algorithm. Our result relies on a recently proposed exponential decay of sensitivity property and, to the best of our knowledge, is the first of its kind in this setting. We illustrate the performance of our algorithm using a simple numerical example.
|
Damianos Tranos · Alexandre Proutiere 🔗 |
-
|
Breaking the Curse of Multiagents in a Large State Space: RL in Markov Games with Independent Linear Function Approximation
(
Poster
)
link »
We propose a new model, \emph{independent linear Markov game}, for multi-agent reinforcement learning with a large state space and a large number of agents.This is a class of Markov games with \emph{independent} linear function approximation, where each agent has its own function approximation for the state-action value functions that are {\it marginalized} by other players' policies. We design new algorithms for learning the Markov coarse correlated equilibria (CCE) and Markov correlated equilibria (CE) with sample complexity bounds that only scale polynomially with \emph{each agent's own function class complexity}, thus breaking the curse of multiagents. In contrast, existing works for Markov games with function approximation have sample complexity bounds scale with the size of the \emph{joint action space} when specialized to the canonical tabular Markov game setting, which is exponentially large in the number of agents. Our algorithms rely on two key technical innovations: (1) utilizing policy replay to tackle {\it non-stationarity} incurred by multiple agents and the use of function approximation; (2) separating learning Markov equilibria and exploration in the Markov games, which allows us to use the full-information no-regret learning oracle instead of the stronger bandit-feedback no-regret learning oracle used in the tabular setting. Furthermore, we propose an iterative-best-response type algorithm that can learn pure Markov Nash equilibria in independent linear Markov potential games, with applications in learning in congestion games.In the tabular case, by adapting the policy replay mechanism for independent linear Markov games, we propose an algorithm with $\widetilde{O}(\epsilon^{-2})$ sample complexity to learn Markov CCE, which improves the state-of-the-art result $\widetilde{O}(\epsilon^{-3})$ in \cite{daskalakis2022complexity}, where $\epsilon$ is the desired accuracy, and also significantly improves other problem parameters. Furthermore, we design the first provably efficient algorithm for learning Markov CE that breaks the curse of multiagents.
|
Qiwen Cui · Kaiqing Zhang · Simon Du 🔗 |
-
|
Offline Goal-Conditioned RL with Latent States as Actions
(
Poster
)
link »
In the same way that unsupervised pre-training has become the bedrock for computer vision and NLP, goal-conditioned RL might provide a similar strategy for making use of vast quantities of unlabeled (reward-free) data. However, building effective algorithms for goal-conditioned RL, ones that can learn directly from offline data, is challenging because it is hard to accurately estimate the exact state value of reaching faraway goals. Nonetheless, goal-reaching problems exhibit structure – reaching a distant goal entails visiting some closer states (or representations thereof) first. Importantly, it is easier to assess the effect of actions on getting to these closer states. Based on this idea, we propose a hierarchical algorithm for goal-conditioned RL from offline data. Using one action-free value function, we learn two policies that allow us to exploit this structure: a high-level policy that predicts (a representation of) a waypoint, and a low-level policy that predicts the action for reaching this waypoint. Through analysis and didactic examples, we show how this hierarchical decomposition makes our method robust to noise in the estimated value function. We then apply our method to offline goal-reaching benchmarks, showing that our method can solve long-horizon tasks that stymie prior methods, can scale to high-dimensional image observations, and can readily make use of action-free data. |
Seohong Park · Dibya Ghosh · Benjamin Eysenbach · Sergey Levine 🔗 |
-
|
Variational quantum dynamics of two-dimensional rotor models
(
Poster
)
link »
We present a simulation method for the dynamics of continuous-variable quantum many-body systems based on neural-network quantum states. The focus is put on dynamics of experimentally relevant two-dimensional quantum rotors. We simulate previously unreachable system sizes and simulation times using a neural-network trial wavefunction in a continuous basis and using modern sampling approaches based on Hamiltonian Monte Carlo. The method is demonstrated to be able to access quantities like the return probability and vorticity oscillations after a quantum quench in two-dimensional systems of up to 64 (8 $\times$ 8) coupled rotors. Our approach can be used for accurate non-equilibrium simulations of continuous systems at previously unexplored system sizes and evolution times, bridging the gap between simulation and experiment.
|
Matija Medvidović · Dries Sels 🔗 |
-
|
Sample Complexity of Hierarchical Decompositions in Markov Decision Processes
(
Poster
)
link »
Hierarchical Reinforcement Learning (HRL) algorithms perform planning at multiple levels of abstraction. Algorithms that leverage states or temporal abstractions have empirically demonstrated a gain in sample efficiency. Yet, the basis of those efficiency gains is not fully understood and we still lack theoretically-grounded design rules to implement HRL algorithms. Here, we derive a lower bound on the sample complexity for the proposed class of goal-conditioned HRL algorithms (such as Dot-2-Dot \cite{beyret2019dot}) that inspires a novel Q-learning algorithm and highlights the relationship between the properties of the decomposition and the sample complexity. Specifically, the proposed lower bound on the sample complexity of such HRL algorithms allows to quantify the benefits of hierarchical decomposition. These theoretical findings guide the formulation of a simple Q-learning-type algorithm that leverages goal hierarchical decomposition. We then empirically validate our lower bound by investigating the sample complexity of the proposed hierarchical algorithm on a spectrum of tasks. Our tasks were designed to allow us to dial up or down their complexity over multiple orders of magnitude. Our theoretical and algorithmic results provide a clear step towards understanding the foundational question of quantifying the efficiency gains induced by hierarchies in reinforcement learning. |
Arnaud Robert · Ciara Pike-Burke · Aldo Faisal 🔗 |
-
|
Boosting Off-policy RL with Policy Representation and Policy-extended Value Function Approximator
(
Poster
)
link »
Off-policy Reinforcement Learning (RL) is fundamental to realizing intelligent decision-making agents by trial and error.The most notorious issue of off-policy RL is known as Deadly Triad, i.e., Bootstrapping, Function Approximation, and Off-policy Learning.Despite recent advances in bootstrapping algorithms with better bias control, improvements on the latter two factors are relatively less studied. In this paper, we propose a general off-policy RL algorithm based on policy representation and policy-extended value function approximator (PeVFA). Orthogonal to better bootstrapping, our improvement is two-fold. On one hand, PeVFA's nature in fitting the value functions of multiple policies according to corresponding low-dimensional policy representation offers preferable function approximation with less interference and better generalization. On the other hand, PeVFA and policy representation allow to perform off-policy learning in a more general and sufficient manner. Specifically, we perform additional value learning for proximal historical policies along the learning process.This drives the value generalization from learned policies and in turn, leads to more efficient learning. We evaluate our algorithms on continuous control tasks and the empirical results demonstrate consistent improvements in terms of efficiency and stability. |
Min Zhang · Jianye Hao · Hongyao Tang · Yan Zheng 🔗 |
-
|
Guide Your Agent with Adaptive Multimodal Rewards
(
Poster
)
link »
Recent work have shown that incorporating pre-trained multimodal representations can enhance the ability of an instruction-following agent to generalize to unseen situations. Yet training such agents often requires a dataset consisting of diverse demonstrations, which may not be available for target domains and incur a huge cost to collect. In this paper, we instead propose to utilize the knowledge captured within large vision-language models for improving the generalization capability of control agents. To this end, we present Multimodal Reward Decision Transformer (MRDT), a simple yet effective method that uses the visual-text alignment score as a reward. This reward, which adapts based on the progress towards achieving the text-specified goals, is used to train a return-conditioned policy that guides the agent towards the desired goals. We also introduce a fine-tuning scheme that adapts pre-trained multimodal models using in-domain data to improve the quality of rewards. Our experiments demonstrate that MRDT significantly improves generalization performance in test environments with unseen goals. Moreover, we introduce new metrics for evaluating the quality of multimodal rewards and show that generalization performance increases as the quality of rewards improves. |
Changyeon Kim · Younggyo Seo · Hao Liu · Lisa Lee · Jinwoo Shin · Honglak Lee · Kimin Lee 🔗 |
-
|
Neural Optimal Transport with Lagrangian Costs
(
Poster
)
link »
Computational efforts in optimal transport traditionally revolvearound the squared-Euclidean cost. In this work, we choose toinvestigate the optimal transport problem between probability measureswhen the underlying metric space is non-Euclidean, or when the costfunction is understood to satisfy a least action principle,also known as a Lagrangian cost. These two generalizations are useful when connecting observations from a physical system, where the transport dynamics are influencedby the geometry of the system, such as obstacles, and allowspractitioners to incorporate a priori knowledge of theunderlying system. Examples include barriers for transport, orenforcing a certain geometry, i.e., paths must be circular.We demonstrate the effectiveness of this formulation on existingsynthetic examples in the literature, where we solve the optimaltransport problems in the absence of regularization, which is novel inthe literature. Our contributions are of computational interest, where we demonstrate the ability to efficiently compute geodesics and amortize spline-based paths. We demonstrate the effectiveness of this formulation on existing synthetic examples in the literature, where we solve the optimal transport problems in the absence of regularization. |
Aram-Alexandre Pooladian · Carles Domingo i Enrich · Ricky T. Q. Chen · Brandon Amos 🔗 |
-
|
Deep Equilibrium Based Neural Operators for Steady-State PDEs
(
Poster
)
link »
Data-driven machine learning approaches are being increasingly used to solve partial differential equations (PDEs). They have shown particularly striking successes when training an operator, which takes as input a PDE in some family, and outputs its solution. However, the architectural design space, especially given structural knowledge of the PDE family of interest, is still poorly understood. We seek to remedy this gap by studying the benefits of weight-tied neural network architectures for steady-state PDEs. To achieve this, we first demonstrate that the solution of most steady-state PDEs can be expressed as a fixed point of a non-linear operator. Motivated by this observation, we propose FNO-DEQ, a deep equilibrium variant of the FNO architecture that directly solves for the solution of a steady-state PDE as the infinite-depth fixed point of an implicit operator layer using a black-box root solver and differentiates analytically through this fixed point resulting in $\mathcal{O}(1)$ training memory. Our experiments indicate that FNO-DEQ-based architectures outperform FNO-based baselines with $4\times$ the number of parameters in predicting the solution to steady-state PDEs such as Darcy Flow and steady-state incompressible Navier-Stokes. Finally, we show FNO-DEQ is more robust when trained with datasets with more noisy observations than the FNO-based baselines, demonstrating the benefits of using appropriate inductive biases in architectural design for different neural network based PDE solvers. Further, we show a universal approximation result that demonstrates that FNO-DEQ can approximate the solution to any steady-state PDE that can be written as a fixed point equation.
|
Tanya Marwah · Ashwini Pokle · Zico Kolter · Zachary Lipton · Jianfeng Lu · Andrej Risteski 🔗 |
-
|
Look Beneath the Surface: Exploiting Fundamental Symmetry for Sample-Efficient Offline RL
(
Poster
)
link »
Offline reinforcement learning (RL) offers an appealing approach to real-world tasks by learning policies from pre-collected datasets without interacting with the environment. However, the performance of existing offline RL algorithms heavily depends on the scale and state-action space coverage of datasets. Real-world data collection is often expensive and uncontrollable, leading to small and narrowly covered datasets and posing significant challenges for practical deployments of offline RL. In this paper, we provide a new insight that leveraging the fundamental symmetry of system dynamics can substantially enhance offline RL performance under small datasets. Specifically, we propose a Time-reversal symmetry (T-symmetry) enforced Dynamics Model (TDM), which establishes consistency between a pair of forward and reverse latent dynamics. TDM provides both well-behaved representations for small datasets and a new reliability measure for OOD samples based on compliance with the T-symmetry. These can be readily used to construct a new offline RL algorithm (TSRL) with less conservative policy constraints and a reliable latent space data augmentation procedure. Based on extensive experiments, we find TSRL achieves great performance on small benchmark datasets with as few as 1\% of the original samples, which significantly outperforms the recent offline RL algorithms in terms of data efficiency and generalizability. |
PENG CHENG · Xianyuan Zhan · Zhihao Wu · Wenjia Zhang · Youfang Lin · Shou cheng Song · Han Wang 🔗 |
-
|
Nonlinear Wasserstein Distributionally Robust Optimal Control
(
Poster
)
link »
This paper presents a novel approach to addressing the distributionally robust nonlinear model predictive control (DRNMPC) problem. Current literature primarily focuses on the static Wasserstein distributionally robust optimal control problem with a prespecified ambiguity set of uncertain system states. Although a few studies have tackled the dynamic setting, a practical algorithm remains elusive. To bridge this gap, we introduce a DRNMPC scheme that dynamically controls the propagation of ambiguity, based on the constrained iterative linear quadratic regulator. The theoretical results are also provided to characterize the stochastic error reachable sets under ambiguity. We evaluate the effectiveness of our proposed iterative DRMPC algorithm by comparing the closed-loop performance of feedback and open-loop on a mass-spring system, and demonstrate in numerical experiments that our algorithm controls the propagated Wasserstein ambiguity. |
Zhengang Zhong · Jia-Jie Zhu 🔗 |
-
|
Trajectory Generation, Control, and Safety with Denoising Diffusion Probabilistic Models
(
Poster
)
link »
We present a framework for safety-critical optimal control of physical systems based on denoising diffusion probabilistic models (DDPMs). The technology of control barrier functions (CBFs), encoding desired safety constraints, is used in combination with DDPMs to plan actions by iteratively denoising trajectories through a CBF-based guided sampling procedure. At the same time, the generated trajectories are also guided to maximize a future cumulative reward representing a specific task to be optimally executed.The proposed scheme can be seen as an offline and model-based reinforcement learning algorithm resembling in its functionalities a model-predictive control optimization scheme with receding horizon in which the selected actions lead to optimal and safe trajectories. |
Nicolò Botteghi · Federico Califano · University Twente · Christoph Brune 🔗 |
-
|
Coupled Gradient Flows for Strategic Non-Local Distribution Shift
(
Poster
)
link »
We propose a novel framework for analyzing the dynamics of distribution shift in real-world systems that captures the feedback loop between learning algorithms and the distributions on which they are deployed. Prior work largely models feedback-induced distribution shift as adversarial or via an overly simplistic distribution-shift structure. In contrast, we propose a coupled partial differential equation model that captures fine-grained changes in the distribution over time by accounting for complex dynamics that arise due to strategic responses to algorithmic decision-making, non-local endogenous population interactions, and other exogenous sources of distribution shift. We consider two common settings in machine learning: cooperative settings with information asymmetries, and competitive settings where a learner faces strategic users. For both of these settings, when the algorithm retrains via gradient descent, we prove asymptotic convergence of the retraining procedure to a steady-state, both in finite and in infinite dimensions, obtaining explicit rates in terms of the model parameters. To do so we derive new results on the convergence of coupled PDEs that extends what is known on multi-species systems. Empirically, we show that our approach captures well-documented forms of distribution shifts like polarization and disparate impacts that simpler models cannot capture. |
Lauren Conger · Franca Hoffmann · Eric Mazumdar · Lillian Ratliff 🔗 |
-
|
Efficient RL with Impaired Observability: Learning to Act with Delayed and Missing State Observations
(
Poster
)
link »
In real-world reinforcement learning (RL) systems, various forms of impaired observability can complicate matters. These situations arise when an agent is unable to observe the most recent state of the system due to latency or lossy channels, yet the agent must still make real-time decisions. This paper introduces a theoretical investigation into efficient RL in control systems where agents must act with delayed and missing state observations. We establish near-optimal regret bounds, of the form $\tilde{\mathcal{O}}(\sqrt{{\rm poly}(H) SAK})$, for RL in both the delayed and missing observation settings. Despite impaired observability posing significant challenges to the policy class and planning, our results demonstrate that learning remains efficient, with the regret bound optimally depending on the state-action size of the original system. Additionally, we provide a characterization of the performance of the optimal policy under impaired observability, comparing it to the optimal value obtained with full observability.
|
Minshuo Chen · Yu Bai · H. Vincent Poor · Mengdi Wang 🔗 |
-
|
LEAD: Min-Max Optimization from a Physical Perspective
(
Poster
)
link »
Adversarial formulations have rekindled interest in two-player min-max games. A central obstacle in the optimization of such games is the rotational dynamics that hinder their convergence. In this paper, we show that game optimization shares dynamic properties with particle systems subject to multiple forces, and one can leverage tools from physics to improve optimization dynamics. Inspired by the physical framework, we propose LEAD, an optimizer for min-max games. Next, using Lyapunov stability theory from dynamical systems as well as spectral analysis, we study LEAD’s convergence properties in continuous and discrete time settings for a class of quadratic min-max games to demonstrate linear convergence to the Nash equilibrium. Finally, we empirically evaluate our method on synthetic setups and CIFAR-10 image generation to demonstrate improvements in GAN training. |
Reyhane Askari Hemmat · Amartya Mitra · Guillaume Lajoie · Ioannis Mitliagkas 🔗 |
-
|
Stochastic Linear Bandits with Unknown Safety Constraints and Local Feedback
(
Poster
)
link »
In many real-world decision-making tasks, e.g. clinical trials, the agents must satisfy a diverse set of unknown safety constraints at all times while getting feedback only on the safety constraints relevant to the chosen action, e.g. the ones close to violation. In this work, we study stochastic linear bandits with such unknown safety constraints and local safety feedback. The agent's goal is to maximize the cumulative reward while satisfying \textit{multiple unknown affine or nonlinear} safety constraints. At each time step, the agent receives noisy feedback on a particular safety constraint \textit{only if} the chosen action belongs to the associated constraint set, i.e. local safety feedback. For this setting, we design upper confidence bound and Thompson Sampling-based algorithms. In the design of these algorithms, we carefully prescribe an additional exploration incentive that guarantees the selection of high-reward actions that are also safe and ensures sufficient exploration in the relevant constraint sets to recover the optimal safe action. We show that for $M$ distinct constraints, both of these algorithms attain $\tilde{\mathcal{O}}(\sqrt{MT})$ regret after $T$ time steps without any safety violations. We empirically study the performance of the proposed algorithms under various safety constraints and with a real-world credit dataset. We show that both algorithms safely explore and quickly recover the optimal safe actions.
|
Nithin Varma · Sahin Lale · Anima Anandkumar 🔗 |
-
|
Distributional Distance Classifiers for Goal-Conditioned Reinforcement Learning
(
Poster
)
link »
What does it mean to find the shortest path in stochastic environments, where every strategy has a non-zero probability of failing? At the core of this question is a conflict between two seemingly-natural notions of planning: maximizing the probability of reaching a goal state, and minimizing the expected number of steps to reach that goal state. Reinforcement learning (RL) methods based on minimizing the steps to a goal make an implicit assumption: that the goal is always reached, at least within some finite horizon. This assumption is violated in practical settings and can lead to very suboptimal strategies. In this paper, we bridge the gap between these two notions of planning by estimating the probability of reaching the goal at different horizons. This is not the same as estimating the distance to the goal -- rather, probabilities convey uncertainty in ever reaching the goal at all. We then propose an algorithm for estimating these probabilities. The update rule resembles distributional RL but is used to solve (reward-free) goal-reaching tasks rather than (single) reward-maximization tasks. Taken together, we believe that our results provide a cogent framework for thinking about probabilities and distances in stochastic settings, along with a practical and effective algorithm for solving goal-reaching problems in many settings. |
Ravi Tej Akella · Benjamin Eysenbach · Jeff Schneider · Ruslan Salakhutdinov 🔗 |
-
|
Taylorformer: Probabalistic Modelling for Random Processes including Time Series
(
Poster
)
link »
We propose the Taylorformer for random processes such as time series. Its two key components are: 1) the LocalTaylor wrapper which adapts Taylor approximations (used in dynamical systems) for use in neural network-based probabilistic models, and 2) the MHA-X attention block which makes predictions in a way inspired by how Gaussian Processes' mean predictions are linear smoothings of contextual data. Taylorformer outperforms the state-of-the-art in terms of log-likelihood on 5/6 classic Neural Process tasks such as meta-learning 1D functions, and has at least a 14\% MSE improvement on forecasting tasks, including electricity, oil temperatures and exchange rates. Taylorformer approximates a consistent stochastic process and provides uncertainty-aware predictions. Our code is provided in the supplementary material. |
Omer Nivron · Raghul Parthipan · Damon Wischik 🔗 |
-
|
Policy Gradient Algorithms Implicitly Optimize by Continuation
(
Poster
)
link »
Direct policy optimization in reinforcement learning is usually solved with policy-gradient algorithms, which optimize policy parameters via stochastic gradient ascent. This paper provides a new theoretical interpretation and justification of these algorithms. First, we formulate direct policy optimization in the optimization by continuation framework. The latter is a framework for optimizing nonconvex functions where a sequence of surrogate objective functions, called continuations, are locally optimized. Second, we show that optimizing affine Gaussian policies and performing entropy regularization can be interpreted as implicitly optimizing deterministic policies by continuation. Based on these theoretical results, we argue that exploration in policy-gradient algorithms consists in computing a continuation of the return of the policy at hand, and that the variance of policies should be history-dependent functions adapted to avoid local extrema rather than to maximize the return of the policy. |
Adrien Bolland · Gilles Louppe · Damien Ernst 🔗 |
-
|
Randomly Coupled Oscillators for Time Series Processing
(
Poster
)
link »
We investigate a physically-inspired recurrent neural network derived from a continuous-time ODE modelling a network of coupled oscillators. Enthralled by the Reservoir Computing paradigm, we introduce the Randomly Coupled Oscillators (RCO) model, which leverages an untrained recurrent component with a smart random initialization. We analyse the architectural bias of RCO and its neural dynamics. We derive sufficient conditions for the model to have a unique asymptotically uniformly stable input-driven solution. We also derive necessary conditions for stability, that permit to push the system of oscillators slightly beyond the edge of stability. We empirically assess the effectiveness of RCO in terms of its stability and its long-term memory properties. We compare its performance against both fully-trained and randomized recurrent models in a number of time series processing tasks. We find that RCO provides an excellent trade-off between robust long-term memory properties and ability to predict the behavior of non-linear, chaotic systems. |
Andrea Ceni · Andrea Cossu · Jingyue Liu · Maximilian Stölzle · Cosimo Della Santina · Claudio Gallicchio · Davide Bacciu 🔗 |
-
|
On a Connection between Differential Games, Optimal Control, and Energy-based Models for Multi-Agent Interactions
(
Poster
)
link »
Game theory offers an interpretable mathematical framework for modeling multi-agent interactions. However, its applicability in real-world robotics applications is hindered by several challenges, such as unknown agents' preferences and goals. To address these challenges, we establish a connection between differential games, optimal control, and energy-based models and demonstrate how existing approaches can be unified under our proposed Energy-based Potential Game formulation. Building upon this formulation, this work introduces a new end-to-end learning application that combines neural networks for game-parameter inference with a differentiable game-theoretic optimization layer, acting as an inductive bias. The experiments using simulated mobile robot pedestrian interactions and real-world automated driving data provide empirical evidence that the game-theoretic layer improves the predictive performance of various neural network backbones. |
Christopher Diehl · Tobias Klosek · Martin Krueger · Nils Murzyn · Torsten Bertram 🔗 |
-
|
Importance Weighted Actor-Critic for Optimal Conservative Offline Reinforcement Learning
(
Poster
)
link »
We propose A-Crab (Actor-Critic Regularized by Average Bellman error), a new algorithm for offline reinforcement learning (RL) in complex environments with insufficient data coverage. Our algorithm combines the marginalized importance sampling framework with the actor-critic paradigm, where the critic returns evaluations of the actor (policy) that are pessimistic relative to the offline data and have a small average (importance-weighted) Bellman error. Compared to existing methods, our algorithm simultaneously offers a number of advantages:(1) It achieves the optimal statistical rate of $1/\sqrt{N}$---where $N$ is the size of offline dataset---in converging to the best policy covered in the offline dataset, even when combined with general function approximators.(2) It relies on a weaker *average* notion of policy coverage (compared to the $\ell_\infty$ single-policy concentrability) that exploits the structure of policy visitations.(3) It outperforms the data-collection behavior policy over a wide range of specific hyperparameters.
|
Hanlin Zhu · Paria Rashidinejad · Jiantao Jiao 🔗 |
-
|
Fit Like You Sample: Sample-Efficient Generalized Score Matching from Fast Mixing Markov Chains
(
Poster
)
link »
Score matching is an approach to learning probability distributions parametrized up to a constant of proportionality (e.g. EBMs). The idea is to fit the score of the distribution (i.e. $\nabla_x \log p(x)$), rather than the likelihood, thus avoiding the need to evaluate the constant of proportionality. While there's a clear algorithmic benefit, the statistical "cost" can be steep: recent work by Koehler et al '23 showed that for distributions that have poor isoperimetric properties (a large Poincare or log-Sobolev constant), score matching is substantially statistically less efficient than maximum likelihood. However, many natural realistic distributions, e.g. multimodal distributions as simple as a mixture of two Gaussians---even in one dimension---have a poor Poincare constant.In this paper, we show a close connection between the mixing time of an arbitrary Markov process with generator $\mathcal{L}$ and a generalized score matching loss that tries to fit $\frac{\mathcal{O}p}{p}$. We instantiate this framework with several examples. In the special case of $\mathcal{O} = \nabla_x$, and $\mathcal{L}$ being the generator of Langevin diffusion, this generalizes and recovers the results from Koehler et al '23. If $\mathcal{L}$ corresponds to a Markov process corresponding to a continuous version of simulated tempering, we show the corresponding generalized score matching loss is a Gaussian-convolution annealed score matching loss, akin to the one proposed in Song-Ermon '19. Moreover, we show that if the distribution being learned is a mixture of $K$ Gaussians in $d$ dimensions, the sample complexity of annealed score matching is polynomial in $d$ and $K$ --- obviating the Poincar'e constant-based lower bounds of the basic score matching loss shown in Koehler et al. This is the first result characterizing the benefits of annealing for score matching---a crucial component in more sophisticated score-based approaches like Song-Ermon '19.
|
Yilong Qin · Andrej Risteski 🔗 |
-
|
IQL-TD-MPC: Implicit Q-Learning for Hierarchical Model Predictive Control
(
Poster
)
link »
Model-based reinforcement learning (RL) has shown great promise due to its sample efficiency, but still struggles with long-horizon sparse-reward tasks, especially in offline settings where the agent learns from a fixed dataset. We hypothesize that model-based RL agents struggle in these environments due to a lack of long-term planning capabilities, and that planning in a temporally abstract model of the environment can alleviate this issue. In this paper, we make two key contributions: 1) we introduce an offline model-based RL algorithm, IQL-TD-MPC, that extends the state-of-the-art Temporal Difference Learning for Model Predictive Control (TD-MPC) with Implicit Q-Learning (IQL); 2) we propose to use IQL-TD-MPC as a Manager in a hierarchical setting with any off-the-shelf offline RL algorithm as a Worker. More specifically, we pre-train a temporally abstract IQL-TD-MPC Manager to predict "intent embeddings", which roughly correspond to subgoals, via planning. We empirically show that augmenting state representations with intent embeddings generated by an IQL-TD-MPC manager significantly improves off-the-shelf offline RL agents' performance on some of the most challenging D4RL benchmark tasks. For instance, the offline RL algorithms AWAC, TD3-BC, DT, and CQL all get zero or near-zero normalized evaluation scores on the medium and large antmaze tasks, while our modification gives an average score over 40. |
Yingchen Xu · Rohan Chitnis · Bobak Hashemi · Lucas Lehnert · Urun Dogan · Zheqing Zhu · Olivier Delalleau 🔗 |
-
|
Parallel Sampling of Diffusion Models
(
Poster
)
link »
Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 16s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score. |
Andy Shih · Suneel Belkhale · Stefano Ermon · Dorsa Sadigh · Nima Anari 🔗 |
-
|
Fast Approximation of the Generalized Sliced-Wasserstein Distance
(
Poster
)
link »
Generalized sliced-Wasserstein distance is a variant of sliced-Wasserstein distance that exploits the power of non-linear projection through a given defining function to better capture the complex structures of probability distributions. Similar to the sliced-Wasserstein distance, generalized sliced-Wasserstein is defined as an expectation over random projections which can be approximated by the Monte Carlo method. However, the complexity of that approximation can be expensive in high-dimensional settings. To that end, we propose to form deterministic and fast approximations of the generalized sliced-Wasserstein distance by using the concentration of random projections when the defining functions are polynomial function and neural network type function. Our approximations hinge upon an important result that one-dimensional projections of a high-dimensional random vector are approximately Gaussian. |
Dung Le · Huy Nguyen · Khai Nguyen · Nhat Ho 🔗 |
-
|
A Policy-Decoupled Method for High-Quality Data Augmentation in Offline Reinforcement Learning
(
Poster
)
link »
Offline reinforcement learning (ORL) has gained attention as a means of training reinforcement learning models using pre-collected static data. To address the issue of limited data and improve downstream ORL performance, recent work has attempted to expand the dataset's coverage through data augmentation. However, most of these methods are tied to a specific policy (policy-dependent), where the generated data can only guarantee to support the current downstream ORL policy, limiting its usage scope on other downstream policies. Moreover, the quality of synthetic data is often not well-controlled, which limits the potential for further improving the downstream policy. To tackle these issues, we propose HIgh-quality POlicy-DEcoupled (HIPODE), a novel data augmentation method for ORL. On the one hand, HIPODE generates high-quality synthetic data by selecting states near the dataset distribution with potentially high value among candidate states using the negative sampling technique. On the other hand, HIPODE is policy-decoupled, thus can be used as a common plug-in method for any downstream ORL process. We conduct experiments on the widely studied TD3BC and CQL algorithms, and the results show that HIPODE outperforms the state-of-the-art policy-decoupled data augmentation method and most prevalent model-based ORL methods on D4RL benchmarks. |
Shixi Lian · Yi Ma · Jinyi Liu · Jianye Hao · Yan Zheng · Zhaopeng Meng 🔗 |
-
|
On the Generalization Capacities of Neural Controlled Differential Equations
(
Poster
)
link »
We consider a supervised learning setup in which the goal is to predicts an outcome from a sample of irregularly sampled time series using Neural Controlled Differential Equations (Kidger, Morrill, et al. 2020). In our framework, the time series is a discretization of an unobserved continuous path, and the outcome depends on this path through a controlled differential equation with unknown vector field. Learning with discrete data thus induces a discretization bias, which we precisely quantify. Using theoretical results on the continuity of the flow of controlled differential equations, we show that the approximation bias is directly related to the approximation error of a Lipschitz function defining the generative model by a shallow neural network. By combining these result with recent work linking the Lipschitz constant of neural networks to their generalization capacities, we upper bound the generalization gap between the expected loss attained by the empirical risk minimizer and the expected loss of the true predictor. |
Linus Bleistein · Agathe Guilloux 🔗 |
-
|
Factor Learning Portfolio Optimization Informed by Continuous-Time Finance Models
(
Poster
)
link »
We study financial portfolio optimization in the presence of unknown and uncontrolled system variables referred to as stochastic factors. Existing work falls into two distinct categories: (i) reinforcement learning employs end-to-end policy learning with flexible factor representation, but does not precisely model the dynamics of asset prices or factors; (ii) continuous-time finance methods, in contrast, take advantage of explicitly modeled dynamics but pre-specify, rather than learn, factor representation. We propose FaLPO (factor learning portfolio optimization), a framework that interpolates between these two approaches. Specifically, FaLPO hinges on deep policy gradient to learn a performant investment policy that takes advantage of flexible representation for stochastic factors. Meanwhile, FaLPO also incorporates continuous-time finance models when modeling the dynamics. It uses the optimal policy functional form derived from such models and optimizes an objective that combines policy learning and model calibration. We prove the convergence of FaLPO and provide performance guarantees via a finite-sample bound. On both synthetic and real-world portfolio optimization tasks, we observe that FaLPO outperforms five leading methods. Finally, we show that FaLPO can be extended to other decision-making problems with stochastic factors. |
Sinong Geng · houssam nassif · Zhaobin Kuang · A. Max Reppen · Ronnie Sircar 🔗 |
-
|
Modular Hierarchical Reinforcement Learning for Robotics: Improving Scalability and Generalizability
(
Poster
)
link »
We present a novel software architecture for reinforcement learning applied to robotics that emphasizes modularity and reusability. Our method treats each agent as a plug-and-play ROS node that can be easily integrated into a larger HRL system, similar to using software libraries in programming. This modular approach improves the scalability and generalizability of pre-trained reinforcement learning agents. We demonstrate the effectiveness of our method by solving the real-world task of stacking three objects with two different robots that were trained only in simulation. Our results show that the modular approach significantly reduces the training and setup time required compared to a vanilla reinforcement learning baseline. Overall, our work showcases the potential of using trained agents as modules to enable the development of more complex and adaptable robotics applications. |
Mihai Anca · Mark Hansen · Matthew Studley 🔗 |
-
|
Parameterized projected Bellman operator
(
Poster
)
link »
The Bellman operator is a cornerstone of reinforcement learning (RL), widely used from traditional value-based methods to modern actor-critic approaches. In problems with unknown models, the Bellman operator is estimated via transition samples that strongly determine its behavior, as uninformative samples can result in negligible updates or long detours before reaching the fixed point. In this paper, we introduce the novel idea of an operator that acts on the parameters of action-value function approximators. Our novel operator can obtain a sequence of action-value function parameters that progressively approaches the ones of the optimal action-value function. This means that we merge the traditional two-step procedure consisting of applying the Bellman operator and subsequently projecting onto the space of action-value function. For this reason, we call our novel operator projected Bellman operator (PBO). We formulate an optimization problem to learn PBOs for generic sequential decision-making problems, and we analyze the PBO properties in two representative classes of RL problems. Furthermore, we study the use of PBO under the lens of the approximate value iteration framework, devising algorithmic implementations to learn PBOs in both offline and online settings resorting to neural network regression. Eventually, we empirically evince how PBO can overcome the limitations of classical methods, opening up new research directions as a novel paradigm in RL. |
Théo Vincent · Alberto Maria Metelli · Jan Peters · Marcello Restelli · Carlo D'Eramo 🔗 |
-
|
Improving Offline-to-Online Reinforcement Learning with Q-Ensembles
(
Poster
)
link »
Offline reinforcement learning (RL) is a learning paradigm where an agent learns from a fixed dataset of experience. However, learning solely from a static dataset can limit the performance due to the lack of exploration. To overcome it, offline-to-online RL combines offline pre-training with online fine-tuning, which enables the agent to further refine its policy by interacting with the environment in real-time. Despite its benefits, existing offline-to-online RL methods suffer from performance degradation and slow improvement during the online phase. To tackle these challenges, we propose a novel framework called Ensemble-based Offline-to-Online (E2O) RL. By increasing the number of Q-networks, we seamlessly bridge offline pre-training and online fine-tuning without degrading performance. Moreover, to expedite online performance enhancement, we appropriately loosen the pessimism of Q-value estimation and incorporate ensemble-based exploration mechanisms into our framework. Experimental results demonstrate that E2O can substantially improve the training stability, learning efficiency, and final performance of existing offline RL methods during online fine-tuning on a range of locomotion and navigation tasks, significantly outperforming existing offline-to-online RL methods. |
Kai Zhao · Yi Ma · Jinyi Liu · Jianye Hao · Yan Zheng · Zhaopeng Meng 🔗 |
-
|
Model-based Policy Optimization under Approximate Bayesian Inference
(
Poster
)
link »
Model-based reinforcement learning algorithms~(MBRL) present an exceptional potential to enhance sample efficiency within the realm of online reinforcement learning (RL). Nevertheless, a substantial proportion of prevalent MBRL algorithms fail to adequately address the dichotomy of exploration and exploitation. Posterior sampling reinforcement learning (PSRL) emerges as an innovative strategy adept at balancing exploration and exploitation, albeit its theoretical assurances are contingent upon exact inference. In this paper, we show that adopting the same methodology as in exact PSRL can be suboptimal under approximate inference. Motivated by the analysis, we propose an improved factorization for the posterior distribution of polices by removing the conditional independence between the policy and data given the model. By adopting such a posterior factorization, we further propose a general algorithmic framework for PSRL under approximate inference and a practical instantiation of it. Empirically, our algorithm can surpass baseline methods by a significant margin on both dense rewards and sparse rewards tasks from the Deepmind control suite, OpenAI Gym and Metaworld benchmarks. |
Chaoqi Wang · Yuxin Chen · Kevin Murphy 🔗 |
-
|
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
(
Poster
)
link »
Fine manipulation tasks, such as threading cable ties or slotting a battery, are notoriously difficult for robots because they require precision, careful coordination of contact forces, and closed-loop visual feedback. Performing these tasks typically requires high-end robots, accurate sensors, or careful calibration, which can be expensive and difficult to set up. Can learning enable low-cost and imprecise hardware to perform these fine manipulation tasks? We present a low-cost system that performs end-to-end imitation learning directly from real demonstrations, collected with a custom teleoperation interface. Imitation learning, however, presents its own challenges, particularly in high-precision domains: errors in the policy can compound over time, and human demonstrations can be non-stationary. To address these challenges, we develop a simple yet novel algorithm, Action Chunking with Transformers (ACT), which learns a generative model over action sequences. ACT allows the robot to learn 6 difficult tasks in the real world, such as opening a translucent condiment cup and slotting a battery with 80-90% success, with only 10 minutes worth of demonstrations. |
Tony Zhao · Vikash Kumar · Sergey Levine · Chelsea Finn 🔗 |
-
|
On the Imitation of Non-Markovian Demonstrations: From Low-Level Stability to High-Level Planning
(
Poster
)
link »
We propose a theoretical framework for studying the imitation of stochastic, non-Markovian, potentially multi-modal expert demonstrations in nonlinear dynamical systems. Our framework invokes low-level controllers - either learned or implicit in position-command control - to stabilize imitation policies around expert demonstrations. We show that with (a) a suitable low-level stability guarantee and (b) a stochastic continuity property of the learned policy we call ``total variation continuity'' (TVC), an imitator that accurately estimates actions on the demonstrator's state distribution closely matches the demonstrator's distribution over entire trajectories. We then show that TVC can be ensured with minimal degradation of accuracy by combining a popular data-augmentation regimen with a novel algorithmic trick: adding augmentation noise at execution time. We instantiate our guarantees for policies parameterized by diffusion models and prove that if the learner accurately estimates the score of the (noise-augmented) expert policy, then the distribution of imitator trajectories is close to the demonstrator distribution in a natural optimal transport distance. Our analysis constructs intricate couplings between noise-augmented trajectories, a technique that may be of independent interest. We conclude by empirically validating our algorithmic recommendations. |
Adam Block · Daniel Pfrommer · Max Simchowitz 🔗 |
-
|
On Convergence of Approximate Schr\"{o}dinger Bridge with Bounded Cost
(
Poster
)
link »
The Schr\"odinger bridge has demonstrated promising applications in generative models. It is an entropy-regularized optimal-transport (EOT) approach that employs the iterative proportional fitting (IPF) algorithm to solve an alternating projection problem. However, due to the complexity of finding precise solutions for the projections, approximations are often required. In our study, we study the convergence of the IPF algorithm using approximated projections and a bounded cost function. Our results demonstrate an approximate linear convergence with bounded perturbations. While the outcome is not unexpected, the rapid linear convergence towards smooth trajectories suggests the potential to examine the efficiency of the Schrödinger bridge compared to diffusion models. |
Wei Deng · Yu Chen · Tianjiao N Yang · Hengrong Du · Qi Feng · Ricky T. Q. Chen 🔗 |
-
|
Online Control with Adversarial Disturbance for Continuous-time Linear Systems
(
Poster
)
link »
We study online control for continuous-time linear systems with finite sampling rates, where the objective is to design an online procedure that learns under non-stochastic noise and performs comparably to a fixed optimal linear controller. We present a novel two-level online algorithm, by integrating a higher-level learning strategy and a lower-level feedback control strategy. This method offers a practical and robust solution for online control, which achieves sublinear regret. Our work provides one of the first nonasymptotic results for controlling continuous-time linear systems a with finite number of interactions with the system. |
Jingwei Li · Jing Dong · Baoxiang Wang · Jingzhao Zhang 🔗 |
-
|
A Flexible Diffusion Model
(
Poster
)
link »
Denoising Diffusion (score-based) generative models have been widely used for modeling various types of complex data, including images, audio, point clouds, and biomolecules. Recently, the deep connection between forward-backward stochastic differential equations (SDEs) and diffusion-based models has been revealed, and several new variants of SDEs are proposed (e.g., sub-VP, critically-damped Langevin) along this line. Despite the empirical success of several hand-crafted forward SDEs, a great quantity of potentially promising forward SDEs remains unexplored. In this work, we propose a general framework for parameterizing the diffusion models, especially the spatial part of the forward SDEs. A systematic formalism is introduced with theoretical guarantees, and its connection with previous diffusion models is leveraged. Finally, we demonstrate the theoretical advantage of our method from the variational optimization perspective. Numerical experiments on synthetic datasets, MNIST and CIFAR10 are presented to validate the effectiveness of our framework. |
weitao du · He Zhang · Tao Yang · Yuanqi Du 🔗 |
-
|
Synthetic Experience Replay
(
Poster
)
link »
A key theme in the past decade has been that when large neural networks and large datasets combine they can produce remarkable results. In deep reinforcement learning (RL), this paradigm is commonly made possible through experience replay, whereby a dataset of past experiences is used to train a policy or value function. However, unlike in supervised or self-supervised learning, an RL agent has to collect its own data, which is often limited. Thus, it is challenging to reap the benefits of deep learning, and even small neural networks can overfit at the start of training. In this work, we leverage the tremendous recent progress in generative modeling and propose Synthetic Experience Replay (SynthER), a diffusion-based approach to flexibly upsample an agent's collected experience. We show that SynthER is an effective method for training RL agents across offline and online settings, in both proprioceptive and pixel-based environments. In offline settings, we observe drastic improvements when upsampling small offline datasets and see that additional synthetic data also allows us to effectively train larger networks. Furthermore, SynthER enables online agents to train with a much higher update-to-data ratio than before, leading to a significant increase in sample efficiency, without any algorithmic changes. Finally, we open-source our code at https://anonymous.4open.science/r/synther-E717/. |
Cong Lu · Philip Ball · Yee-Whye Teh · Jack Parker-Holder 🔗 |
-
|
Fairness In a Non-Stationary Environment From an Optimal Control Perspective
(
Poster
)
link »
The performance of state-of-the-art machine learning models is observed to degrade in scenarios involving under-represented demographic populations during training.This issue has been extensively studied within a supervised learning framework where data distribution remains unchanged.Nonetheless, real-world use cases often encounter distribution shifts induced by the models in deployment. For example, performance bias against minority users can affect customer retention rates, thereby skewing available data from active users due to the absence of minority user input.This feedback effect further exacerbates the discrepancy across various demographic groups in subsequent time steps. To mitigate this problem, we introduce asymptotic fairness, a criterion that aims at preserving sustained model performance across all demographic populations.In addition, we construct a surrogate retention system, based on existing literature on evolutionary population dynamics, to approximate the dynamics of distribution shifts on active user counts. This system allows the aim of achieving asymptotic fairness to be formulated as an optimal control problem.To evaluate the effectiveness of the proposed method,we design a generic simulation environment that simulates the population dynamics of the feedback effect between user retention and model performance.When we deploy the models to this simulation environment,by considering long-term planning,the optimal control solution outperforms existing baseline methods, demonstrating superior performance. |
Zhuotong Chen · Qianxiao Li · Zheng Zhang 🔗 |
-
|
Physics-informed Localized Learning for Advection-Diffusion-Reaction Systems
(
Poster
)
link »
The global push for new energy solutions, such as Geothermal, and Carbon Capture and Sequestration initiatives has thrust new demands upon the current state-of the-art subsurface fluid simulators. The requirement to be able to simulate a large order of reservoir states simultaneously in a short period of time has opened the door of opportunity for the application of machine learning techniques for surrogate modelling. We propose a novel physics-informed and boundary conditions-aware Localized Learning method which extends the Embed-to-Control (E2C) and Embed-to-Control and Observed (E2CO) models to learn local representations of global state variables in an Advection-Diffusion Reaction system. We show that our model, trained on reservoir simulation data, is able to predict future states of the system for a given a set of controls to a great deal of accuracy with only a fraction of the available information. It hence reduces training times significantly compared to the original E2C and E2CO models, lending to its benefit in application to optimal control problems. |
Surya Sathujoda · Soham Sheth 🔗 |
-
|
On the effectiveness of neural priors in modeling dynamical systems
(
Poster
)
link »
Modelling dynamical systems is an integral component for understanding the natural world. To this end, neural networks are becoming an increasingly popular candidate owing to their ability to learn complex functions from large amounts of data. Despite this recent progress, there has not been an adequate discussion on the architectural regularization that neural networks offer when learning such systems, hindering their efficient usage. In this paper, we initiate a discussion in this direction using coordinate networks as a test bed. We interpret dynamical systems and coordinate networks from a signal processing lens, and show that simple coordinate networks with few layers can be used to solve multiple problems in modelling dynamical systems, without any explicit regularizers. |
Sameera Ramasinghe · Hemanth Saratchandran · Violetta Shevchenko · Simon Lucey 🔗 |
-
|
Bridging Physics-Informed Neural Networks with Reinforcement Learning: Hamilton-Jacobi-Bellman Proximal Policy Optimization (HJBPPO)
(
Poster
)
link »
This paper introduces the Hamilton-Jacobi-Bellman Proximal Policy Optimization (HJBPPO) algorithm into reinforcement learning. The Hamilton-Jacobi-Bellman (HJB) equation is used in control theory to evaluate the optimality of the value function. Our work combines the HJB equation with reinforcement learning in continuous state and action spaces to improve the training of the value network. We treat the value network as a Physics-Informed Neural Network (PINN) to solve for the HJB equation by computing its derivatives with respect to its inputs exactly. The Proximal Policy Optimization (PPO)-Clipped algorithm is improvised with this implementation as it uses a value network to compute the objective function for its policy network. The HJBPPO algorithm shows an improved performance compared to PPO on the MuJoCo environments. |
Amartya Mukherjee · Jun Liu 🔗 |
-
|
Actor-Critic Methods using Physics-Informed Neural Networks: Control of a 1D PDE Model for Fluid-Cooled Battery Packs
(
Poster
)
link »
This paper proposes an actor-critic algorithm for controlling the temperature of a battery pack using a cooling fluid. This is modeled by a coupled 1D partial differential equation (PDE) with a controlled advection term that determines the speed of the cooling fluid. The Hamilton-Jacobi-Bellman (HJB) equation is a PDE that evaluates the optimality of the value function and determines an optimal controller. We propose an algorithm that treats the value network as a Physics-Informed Neural Network (PINN) to solve for the continuous-time HJB equation rather than a discrete-time Bellman optimality equation, and we derive an optimal controller for the environment that we exploit to achieve optimal control. Our experiments show that a hybrid-policy method that updates the value network using the HJB equation and updates the policy network identically to PPO achieves the best results in the control of this PDE system. |
Amartya Mukherjee · Jun Liu 🔗 |
-
|
Optimization or Architecture: What Matters in Non-Linear Filtering?
(
Poster
)
link »
In non-linear filtering, it is traditional to compare non-linear architectures such as neural networks to the standard linear Kalman Filter (KF). We observe that this methodology mixes the evaluation of two separate components: the non-linear architecture, and the numeric optimization method. In particular, the non-linear model is often optimized, whereas the reference KF model is not. We argue that both should be optimized similarly. We suggest the Optimized KF (OKF), which adjusts numeric optimization to the positive-definite KF parameters. We demonstrate how a significant advantage of a neural network over the KF may entirely vanish once the KF is optimized using OKF. This implies that experimental conclusions of certain previous studies were derived from a flawed process. The benefits of OKF over the non-optimized KF are further studied theoretically and empirically, where OKF demonstrates consistently improved accuracy in a variety of problems. |
Ido Greenberg · Netanel Yannay · Shie Mannor 🔗 |
-
|
Gradient-free training of neural ODEs for system identification and control using ensemble Kalman inversion
(
Poster
)
link »
Ensemble Kalman inversion (EKI) is a sequential Monte Carlo method used to solve inverse problems within a Bayesian framework. Unlike backpropagation, EKI is a gradient-free optimization method that only necessitates the evaluation of artificial neural networks in forward passes. In this study, we examine the effectiveness of EKI in training neural ordinary differential equations (neural ODEs) for system identification and control tasks. To apply EKI to optimal control problems, we formulate inverse problems that incorporate a Tikhonov-type regularization term. Our numerical results demonstrate that EKI is an efficient method for training neural ODEs in system identification and optimal control problems, with runtime and quality of solutions that are competitive with commonly used gradient-based optimizers. |
Lucas Böttcher 🔗 |
-
|
What is the Solution for State-Adversarial Multi-Agent Reinforcement Learning?
(
Poster
)
link »
Various methods for Multi-Agent Reinforcement Learning (MARL) have been developed with the assumption that agents' policies are based on accurate state information. However, policies learned through Deep Reinforcement Learning (DRL) are susceptible to adversarial state perturbation attacks. In this work, we propose a State-Adversarial Markov Game (SAMG) and make the first attempt to investigate the fundamental properties of MARL under state uncertainties. Our analysis shows that the commonly used solution concepts of optimal agent policy and robust Nash equilibrium do not always exist in SAMGs. To circumvent this difficulty, we consider a new solution concept called robust agent policy, where agents aim to maximize the worst-case expected state value. We prove the existence of robust agent policy for finite state and finite action SAMGs. Additionally, we propose a Robust Multi-Agent Adversarial Actor-Critic (RMA3C) algorithm to learn robust policies for MARL agents under state uncertainties. Our experiments demonstrate that our algorithm outperforms existing methods when faced with state perturbations and greatly improves the robustness of MARL policies. |
Songyang Han · Sanbao Su · Sihong He · Shuo Han · Haizhao Yang · Fei Miao 🔗 |
Author Information
Valentin De Bortoli (CNRS, ENS Ulm (projet NORIA))
Charlotte Bunne (ETH Zurich)
Guan-Horng Liu (Georgia Institute of Technology)
Tianrong Chen (Georgia Institute of Technology)
Maxim Raginsky
Pratik Chaudhari (UPenn, AWS)
Melanie Zeilinger (ETH Zurich)
Animashree Anandkumar (Caltech and NVIDIA)
More from the Same Authors
-
2021 : Continuous Doubly Constrained Batch Reinforcement Learning »
Rasool Fakoor · Jonas Mueller · Kavosh Asadi · Pratik Chaudhari · Alex Smola -
2022 : Recovering Stochastic Dynamics via Gaussian Schrödinger Bridges »
Ya-Ping Hsieh · Charlotte Bunne · Marco Cuturi · Andreas Krause -
2022 : Physics-Informed Neural Operator for Learning Partial Differential Equations »
Zongyi Li · Hongkai Zheng · Nikola Kovachki · David Jin · Haoxuan Chen · Burigede Liu · Kamyar Azizzadenesheli · Animashree Anandkumar -
2022 : Riemannian Diffusion Schr\"odinger Bridge »
James Thornton · Valentin De Bortoli · Michael Hutchinson · Emile Mathieu · Yee Whye Teh · Arnaud Doucet -
2022 : Recovering Stochastic Dynamics via Gaussian Schrödinger Bridges »
Charlotte Bunne · Ya-Ping Hsieh · Marco Cuturi · Andreas Krause -
2023 : The Training Process of Many Deep Networks Explores the Same Low-Dimensional Manifold »
Jialin Mao · Han Kheng Teoh · Itay Griniasty · Rahul Ramesh · Rubing Yang · Mark Transtrum · James Sethna · Pratik Chaudhari -
2023 : Unbalanced Diffusion Schrödinger Bridge »
Matteo Pariset · Ya-Ping Hsieh · Charlotte Bunne · Andreas Krause · Valentin De Bortoli -
2023 : Aligned Diffusion Schrödinger Bridges »
Vignesh Ram Somnath · Matteo Pariset · Ya-Ping Hsieh · Maria Rodriguez Martinez · Andreas Krause · Charlotte Bunne -
2023 : Improved sampling via learned diffusions »
Julius Berner · Lorenz Richter · Guan-Horng Liu -
2023 : Game Theoretic Neural ODE Optimizer »
Panagiotis Theodoropoulos · Guan-Horng Liu · Tianrong Chen · Evangelos Theodorou -
2023 : Budgeting Counterfactual for Offline RL »
Yao Liu · Pratik Chaudhari · Rasool Fakoor -
2023 : LeanDojo: Theorem Proving with Retrieval-Augmented Language Models »
Kaiyu Yang · Aidan Swope · Alexander Gu · Rahul Chalamala · Shixing Yu · Saad Godil · Ryan Prenger · Animashree Anandkumar -
2023 : On The Ability of Transformers To Learn Recursive Patterns »
Dylan Zhang · Curt Tigges · Talia Ringer · Stella Biderman · Maxim Raginsky -
2023 : Panel Discussion »
Chenlin Meng · Yang Song · Yilun Xu · Ricky T. Q. Chen · Charlotte Bunne · Arash Vahdat -
2023 Poster: The Value of Out-of-Distribution Data »
Ashwin De Silva · Rahul Ramesh · Carey Priebe · Pratik Chaudhari · Joshua Vogelstein -
2023 Poster: I$^2$SB: Image-to-Image Schrödinger Bridge »
Guan-Horng Liu · Arash Vahdat · De-An Huang · Evangelos Theodorou · Weili Nie · Anima Anandkumar -
2023 Poster: A Picture of the Space of Typical Learnable Tasks »
Rahul Ramesh · Jialin Mao · Itay Griniasty · Rubing Yang · Han Kheng Teoh · Mark Transtrum · James Sethna · Pratik Chaudhari -
2023 Poster: SE(3) diffusion model with application to protein backbone generation »
Jason Yim · Brian Trippe · Valentin De Bortoli · Emile Mathieu · Arnaud Doucet · Regina Barzilay · Tommi Jaakkola -
2023 Tutorial: Optimal Transport in Learning, Control, and Dynamical Systems »
Charlotte Bunne · marco cuturi -
2022 : Q/A: Melanie Zeilinger »
Melanie Zeilinger -
2022 : Invited Talk: Melanie Zeilinger »
Melanie Zeilinger -
2022 Poster: Diffusion Models for Adversarial Purification »
Weili Nie · Brandon Guo · Yujia Huang · Chaowei Xiao · Arash Vahdat · Animashree Anandkumar -
2022 Poster: Does the Data Induce Capacity Control in Deep Learning? »
Rubing Yang · Jialin Mao · Pratik Chaudhari -
2022 Spotlight: Diffusion Models for Adversarial Purification »
Weili Nie · Brandon Guo · Yujia Huang · Chaowei Xiao · Arash Vahdat · Animashree Anandkumar -
2022 Spotlight: Does the Data Induce Capacity Control in Deep Learning? »
Rubing Yang · Jialin Mao · Pratik Chaudhari -
2022 Poster: Deep Reference Priors: What is the best way to pretrain a model? »
Yansong Gao · Rahul Ramesh · Pratik Chaudhari -
2022 Spotlight: Deep Reference Priors: What is the best way to pretrain a model? »
Yansong Gao · Rahul Ramesh · Pratik Chaudhari -
2022 Poster: Langevin Monte Carlo for Contextual Bandits »
Pan Xu · Hongkai Zheng · Eric Mazumdar · Kamyar Azizzadenesheli · Animashree Anandkumar -
2022 Poster: Understanding The Robustness in Vision Transformers »
Zhou Daquan · Zhiding Yu · Enze Xie · Chaowei Xiao · Animashree Anandkumar · Jiashi Feng · Jose M. Alvarez -
2022 Spotlight: Understanding The Robustness in Vision Transformers »
Zhou Daquan · Zhiding Yu · Enze Xie · Chaowei Xiao · Animashree Anandkumar · Jiashi Feng · Jose M. Alvarez -
2022 Spotlight: Langevin Monte Carlo for Contextual Bandits »
Pan Xu · Hongkai Zheng · Eric Mazumdar · Kamyar Azizzadenesheli · Animashree Anandkumar -
2021 : Spotlight Set 2-3 | Multi-Scale Representation Learning on Proteins »
Workshop CompBio · Charlotte Bunne -
2021 : Morning Poster Session: JKOnet: Proximal Optimal Transport Modeling of Population Dynamics »
Charlotte Bunne -
2021 : Invited Speaker: Animashree Anandkumar: Stability-aware reinforcement learning in dynamical systems »
Animashree Anandkumar -
2021 : Contributed Talk: JKOnet: Proximal Optimal Transport Modeling of Population Dynamics »
Charlotte Bunne -
2021 : Invited Talk: Maxim Raginsky »
Maxim Raginsky -
2021 Workshop: Workshop on Socially Responsible Machine Learning »
Chaowei Xiao · Animashree Anandkumar · Mingyan Liu · Dawn Song · Raquel Urtasun · Jieyu Zhao · Xueru Zhang · Cihang Xie · Xinyun Chen · Bo Li -
2021 Poster: An Information-Geometric Distance on the Space of Tasks »
Yansong Gao · Pratik Chaudhari -
2021 Spotlight: An Information-Geometric Distance on the Space of Tasks »
Yansong Gao · Pratik Chaudhari -
2021 Poster: Dynamic Game Theoretic Neural Optimizer »
Guan-Horng Liu · Tianrong Chen · Evangelos Theodorou -
2021 Oral: Dynamic Game Theoretic Neural Optimizer »
Guan-Horng Liu · Tianrong Chen · Evangelos Theodorou -
2020 : Q&A: Anima Anandakumar »
Animashree Anandkumar · Jessica Forde -
2020 : Invited Talks: Anima Anandakumar »
Animashree Anandkumar -
2020 Poster: A Free-Energy Principle for Representation Learning »
Yansong Gao · Pratik Chaudhari -
2020 : Mentoring Panel: Doina Precup, Deborah Raji, Anima Anandkumar, Angjoo Kanazawa and Sinead Williamson (moderator). »
Doina Precup · Inioluwa Raji · Angjoo Kanazawa · Sinead A Williamson · Animashree Anandkumar -
2019 Poster: Learning Generative Models across Incomparable Spaces »
Charlotte Bunne · David Alvarez-Melis · Andreas Krause · Stefanie Jegelka -
2019 Oral: Learning Generative Models across Incomparable Spaces »
Charlotte Bunne · David Alvarez-Melis · Andreas Krause · Stefanie Jegelka -
2018 Poster: StrassenNets: Deep Learning with a Multiplication Budget »
Michael Tschannen · Aran Khanna · Animashree Anandkumar -
2018 Oral: StrassenNets: Deep Learning with a Multiplication Budget »
Michael Tschannen · Aran Khanna · Animashree Anandkumar