Timezone: »
In reinforcement learning (RL), a reward function is often assumed at the outset of a policy optimization procedure. Learning in such a fixed reward paradigm in RL can neglect important policy optimization considerations, such as state space coverage and safety. Moreover, it can fail to encompass broader impacts in terms of social welfare, sustainability, or market stability, potentially leading to undesirable emergent behavior and potentially misaligned policy. To mathematically encapsulate the problem of aligning RL policy optimization with such externalities, we consider a bilevel optimization problem and connect it to a principal-agent framework, where the principal specifies the broader goals and constraints of the system at the upper level and the agent solves a Markov Decision Process (MDP) at the lower level. The upper-level deals with learning a suitable reward parametrization corresponding to the broader goals and the lower-level deals with learning the policy for the agent. We propose Principal driven Policy Alignment via Bilevel RL (PPA-BRL), which efficiently aligns the policy of the agent with the principal's goals. We explicitly analyzed the dependence of the principal's trajectory on the lower-level policy, and prove the convergence of PPA-BRL to the stationary point of the problem. We illuminate the merits of this framework in view of alignment with several examples spanning energy-efficient manipulation tasks, social welfare-based tax design, and cost-effective robotic navigation.
Author Information
Souradip Chakraborty (University of Maryland, College Park)
Amrit Bedi (University of Maryland, College Park)
Alec Koppel (JP Morgan Chase AI Research)
Bio: Alec Koppel is a Team Lead/VP at JP Morgan Chase AI Research since June 2022. Previously, he was a Research Scientist within Supply Chain Optimization Technologies (SCOT) at Amazon during 2021-2022, and prior to that, was a Research Scientist at the U.S. Army Research Laboratory in the Computational and Information Sciences Directorate from 2017-2021. He completed his Master's degree in Statistics and Doctorate in Electrical and Systems Engineering, both at the University of Pennsylvania (Penn) in August of 2017. Before coming to Penn, he completed his Master's degree in Systems Science and Mathematics and Bachelor's Degree in Mathematics, both at Washington University in St. Louis (WashU), Missouri. He is a recipient of the 2016 UPenn ESE Dept. Award for Exceptional Service, an awardee of the Science, Mathematics, and Research for Transformation (SMART) Scholarship, a co-author of Best Paper Finalist at the 2017 IEEE Asilomar Conference on Signals, Systems, and Computers, a finalist for the ARL Honorable Scientist Award 2019, an awardee of the 2020 ARL Director's Research Award Translational Research Challenge (DIRA-TRC), a 2020 Honorable Mention from the IEEE Robotics and Automation Letters, and mentor to the 2021 ARL Summer Symposium Best Project Awardee. His research interests are in optimization and machine learning. His academic work focuses on approximate Bayesian inference, reinforcement learning, and decentralized optimization. Applications include robotics and autonomy, sourcing and vendor selection, and financial markets.
Furong Huang (University of Maryland)

Furong Huang is an Assistant Professor of the Department of Computer Science at University of Maryland. She works on statistical and trustworthy machine learning, reinforcement learning, graph neural networks, deep learning theory and federated learning with specialization in domain adaptation, algorithmic robustness and fairness. Furong is a recipient of the MIT Technology Review Innovators Under 35 Asia Pacific Award, the MLconf Industry Impact Research Award, the NSF CRII Award, the Adobe Faculty Research Award, three JP Morgan Faculty Research Awards and finalist of AI in Research - AI researcher of the year for Women in AI Awards North America. She received her Ph.D. in electrical engineering and computer science from UC Irvine in 2016, after which she spent one year as a postdoctoral researcher at Microsoft Research NYC.
Mengdi Wang (Princeton University)
More from the Same Authors
-
2022 : Everyone Matters: Customizing the Dynamics of Decision Boundary for Adversarial Robustness »
Yuancheng Xu · Yanchao Sun · Furong Huang -
2022 : Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy »
xiyao wang · Wichayaporn Wongkamjan · Furong Huang -
2022 : Certifiably Robust Multi-Agent Reinforcement Learning against Adversarial Communication »
Yanchao Sun · Ruijie Zheng · Parisa Hassanzadeh · Yongyuan Liang · Soheil Feizi · Sumitra Ganesh · Furong Huang -
2022 : Efficient Adversarial Training without Attacking: Worst-Case-Aware Robust Reinforcement Learning »
Yongyuan Liang · Yanchao Sun · Ruijie Zheng · Furong Huang -
2023 : Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations »
Yongyuan Liang · Yanchao Sun · Ruijie Zheng · Xiangyu Liu · Tuomas Sandholm · Furong Huang · Stephen Mcaleer -
2023 : Equal Long-term Benefit Rate: Adapting Static Fairness Notions to Sequential Decision Making »
Yuancheng Xu · Chenghao Deng · Yanchao Sun · Ruijie Zheng · xiyao wang · Jieyu Zhao · Furong Huang -
2023 : Reviving Shift Equivariance in Vision Transformers »
Peijian Ding · Davit Soselia · Thomas Armstrong · Jiahao Su · Furong Huang -
2023 : C-Disentanglement: Discovering Causally-Independent Generative Factors under an Inductive Bias of Confounder »
Xiaoyu Liu · Jiaxin Yuan · Bang An · Yuancheng Xu · Yifan Yang · Furong Huang -
2023 : Efficient RL with Impaired Observability: Learning to Act with Delayed and Missing State Observations »
Minshuo Chen · Yu Bai · H. Vincent Poor · Mengdi Wang -
2023 : C-Disentanglement: Discovering Causally-Independent Generative Factors under an Inductive Bias of Confounder »
Xiaoyu Liu · Jiaxin Yuan · Bang An · Yuancheng Xu · Yifan Yang · Furong Huang -
2023 : Mental Calibration: Discovering and Adjusting for Latent Factors Improves Zero-Shot Inference of CLIP »
Bang An · Sicheng Zhu · Michael-Andrei Panaitescu-Liess · Chaithanya Kumar Mummadi · Furong Huang -
2023 : Scaling In-Context Demonstrations with Structured Attention »
Tianle Cai · Kaixuan Huang · Jason Lee · Mengdi Wang · Danqi Chen -
2023 : Sample-Efficient Learning of POMDPs with Multiple Observations In Hindsight »
Jiacheng Guo · Minshuo Chen · Huan Wang · Caiming Xiong · Mengdi Wang · Yu Bai -
2023 : Visual Adversarial Examples Jailbreak Aligned Large Language Models »
Xiangyu Qi · Kaixuan Huang · Ashwinee Panda · Mengdi Wang · Prateek Mittal -
2023 Poster: Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic »
Wesley A. Suttle · Amrit Bedi · Bhrij Patel · Brian Sadler · Alec Koppel · Dinesh Manocha -
2023 Poster: Score Approximation, Estimation and Distribution Recovery of Diffusion Models on Low-Dimensional Data »
Minshuo Chen · Kaixuan Huang · Tuo Zhao · Mengdi Wang -
2023 Poster: STEERING : Stein Information Directed Exploration for Model-Based Reinforcement Learning »
Souradip Chakraborty · Amrit Bedi · Alec Koppel · Mengdi Wang · Furong Huang · Dinesh Manocha -
2023 Poster: Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy »
xiyao wang · Wichayaporn Wongkamjan · Ruonan Jia · Furong Huang -
2023 Poster: Provably Efficient Representation Learning with Tractable Planning in Low-Rank POMDP »
Jiacheng Guo · Zihao Li · Huazheng Wang · Mengdi Wang · Zhuoran Yang · Xuezhou Zhang -
2023 Poster: Effective Minkowski Dimension of Deep Nonparametric Regression: Function Approximation and Statistical Theories »
Zixuan Zhang · Minshuo Chen · Mengdi Wang · Wenjing Liao · Tuo Zhao -
2023 Poster: Learning Unforeseen Robustness from Out-of-distribution Data Using Equivariant Domain Translator »
Sicheng Zhu · Bang An · Furong Huang · Sanghyun Hong -
2022 : Policy Gradient: Theory for Making Best Use of It »
Mengdi Wang -
2022 Poster: On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces »
Amrit Singh Bedi · Souradip Chakraborty · Anjaly Parayil · Brian Sadler · Pratap Tokekar · Alec Koppel -
2022 Poster: Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning approach »
Xuezhou Zhang · Yuda Song · Masatoshi Uehara · Mengdi Wang · Alekh Agarwal · Wen Sun -
2022 Poster: Optimal Estimation of Policy Gradient via Double Fitted Iteration »
Chengzhuo Ni · Ruiqi Zhang · Xiang Ji · Xuezhou Zhang · Mengdi Wang -
2022 Poster: Off-Policy Fitted Q-Evaluation with Differentiable Function Approximators: Z-Estimation and Inference Theory »
Ruiqi Zhang · Xuezhou Zhang · Chengzhuo Ni · Mengdi Wang -
2022 Spotlight: Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning approach »
Xuezhou Zhang · Yuda Song · Masatoshi Uehara · Mengdi Wang · Alekh Agarwal · Wen Sun -
2022 Spotlight: On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces »
Amrit Singh Bedi · Souradip Chakraborty · Anjaly Parayil · Brian Sadler · Pratap Tokekar · Alec Koppel -
2022 Spotlight: Off-Policy Fitted Q-Evaluation with Differentiable Function Approximators: Z-Estimation and Inference Theory »
Ruiqi Zhang · Xuezhou Zhang · Chengzhuo Ni · Mengdi Wang -
2022 Spotlight: Optimal Estimation of Policy Gradient via Double Fitted Iteration »
Chengzhuo Ni · Ruiqi Zhang · Xiang Ji · Xuezhou Zhang · Mengdi Wang -
2022 Poster: Scaling-up Diverse Orthogonal Convolutional Networks by a Paraunitary Framework »
Jiahao Su · Wonmin Byeon · Furong Huang -
2022 Spotlight: Scaling-up Diverse Orthogonal Convolutional Networks by a Paraunitary Framework »
Jiahao Su · Wonmin Byeon · Furong Huang -
2022 Poster: Sharpened Quasi-Newton Methods: Faster Superlinear Rate and Larger Local Convergence Neighborhood »
Qiujiang Jin · Alec Koppel · Ketan Rajawat · Aryan Mokhtari -
2022 Spotlight: Sharpened Quasi-Newton Methods: Faster Superlinear Rate and Larger Local Convergence Neighborhood »
Qiujiang Jin · Alec Koppel · Ketan Rajawat · Aryan Mokhtari -
2021 Poster: Sparse Feature Selection Makes Batch Reinforcement Learning More Sample Efficient »
Botao Hao · Yaqi Duan · Tor Lattimore · Csaba Szepesvari · Mengdi Wang -
2021 Spotlight: Sparse Feature Selection Makes Batch Reinforcement Learning More Sample Efficient »
Botao Hao · Yaqi Duan · Tor Lattimore · Csaba Szepesvari · Mengdi Wang -
2021 Poster: Bootstrapping Fitted Q-Evaluation for Off-Policy Inference »
Botao Hao · Xiang Ji · Yaqi Duan · Hao Lu · Csaba Szepesvari · Mengdi Wang -
2021 Spotlight: Bootstrapping Fitted Q-Evaluation for Off-Policy Inference »
Botao Hao · Xiang Ji · Yaqi Duan · Hao Lu · Csaba Szepesvari · Mengdi Wang -
2020 : QA for invited talk 7 Wang »
Mengdi Wang -
2020 : Invited talk 7 Wang »
Mengdi Wang -
2020 Workshop: Theoretical Foundations of Reinforcement Learning »
Emma Brunskill · Thodoris Lykouris · Max Simchowitz · Wen Sun · Mengdi Wang -
2020 Poster: Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound »
Lin Yang · Mengdi Wang -
2020 Poster: Model-Based Reinforcement Learning with Value-Targeted Regression »
Alex Ayoub · Zeyu Jia · Csaba Szepesvari · Mengdi Wang · Lin Yang -
2020 Poster: Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation »
Yaqi Duan · Zeyu Jia · Mengdi Wang -
2019 Poster: Sample-Optimal Parametric Q-Learning Using Linearly Additive Features »
Lin Yang · Mengdi Wang -
2019 Oral: Sample-Optimal Parametric Q-Learning Using Linearly Additive Features »
Lin Yang · Mengdi Wang -
2018 Poster: Estimation of Markov Chain via Rank-constrained Likelihood »
XUDONG LI · Mengdi Wang · Anru Zhang -
2018 Oral: Estimation of Markov Chain via Rank-constrained Likelihood »
XUDONG LI · Mengdi Wang · Anru Zhang -
2018 Poster: Scalable Bilinear Pi Learning Using State and Action Features »
Yichen Chen · Lihong Li · Mengdi Wang -
2018 Oral: Scalable Bilinear Pi Learning Using State and Action Features »
Yichen Chen · Lihong Li · Mengdi Wang -
2017 Poster: Strong NP-Hardness for Sparse Optimization with Concave Penalty Functions »
Yichen Chen · Dongdong Ge · Mengdi Wang · Zizhuo Wang · Yinyu Ye · Hao Yin -
2017 Talk: Strong NP-Hardness for Sparse Optimization with Concave Penalty Functions »
Yichen Chen · Dongdong Ge · Mengdi Wang · Zizhuo Wang · Yinyu Ye · Hao Yin