Workshop
Models of Human Feedback for AI Alignment
Thomas Kleine Buening 路 Harshit Sikchi 路 Christos Dimitrakakis 路 Scott Niekum 路 Constantin Rothkopf 路 Aadirupa Saha 路 Lirong Xia
Schubert 4 - 6
Fri 26 Jul, midnight PDT
Aligning AI agents with human intentions and values is one of the main barriers to the safe and ethical application of AI systems in the real world. Current approaches mostly rely on highly questionable assumptions about the meaning of observed human feedback or interactions. These include assumptions about rationality in decision-making and belief forming, homogeneity of the population, and other restrictive feedback assumptions. However, the role of such modeling assumptions has mostly been neglected in the literature on AI alignment. In this workshop, we want to bring together perspectives from various disciplines besides ML, including computational social choice, behavioral psychology, and economics, to share experiences and perspectives on models of human feedback and their importance for human-AI alignment and collaboration.
Schedule
Fri 12:00 a.m. - 12:05 a.m.
|
Opening Remarks
SlidesLive Video |
馃敆 |
Fri 12:05 a.m. - 12:50 a.m.
|
Invited Talk: Dylan Hadfield-Menell
(
Talk
)
>
SlidesLive Video |
馃敆 |
Fri 12:50 a.m. - 1:00 a.m.
|
AI Alignment with Changing and Influenceable Reward Functions
(
Oral
)
>
link
SlidesLive Video |
Micah Carroll 路 Davis Foote 路 Anand Siththaranjan 路 Stuart Russell 路 Anca Dragan 馃敆 |
Fri 1:00 a.m. - 1:10 a.m.
|
RLHF and IIA: Perverse Incentives
(
Oral
)
>
link
SlidesLive Video |
Wanqiao Xu 路 Shi Dong 路 Xiuyuan Lu 路 Grace Lam 路 Zheng Wen 路 Benjamin Van Roy 馃敆 |
Fri 1:15 a.m. - 2:00 a.m.
|
Invited Talk: Ariel Procaccia
(
Talk
)
>
SlidesLive Video |
馃敆 |
Fri 2:00 a.m. - 2:10 a.m.
|
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
(
Oral
)
>
link
SlidesLive Video |
Souradip Chakraborty 路 Jiahao Qiu 路 Hui Yuan 路 Alec Koppel 路 Furong Huang 路 Dinesh Manocha 路 Amrit Singh Bedi 路 Mengdi Wang 馃敆 |
Fri 2:10 a.m. - 2:20 a.m.
|
Modeling the Plurality of Human Preferences via Ideal Points
(
Oral
)
>
link
SlidesLive Video |
Daiwei Chen 路 Yi Chen 路 Aniket Rege 路 Ramya Vinayak 馃敆 |
Fri 2:20 a.m. - 2:30 a.m.
|
Prompt Optimization with Human Feedback
(
Oral
)
>
link
SlidesLive Video |
Xiaoqiang Lin 路 Zhongxiang Dai 路 Arun Verma 路 See-Kiong Ng 路 Patrick Jaillet 路 Bryan Kian Hsiang Low 馃敆 |
Fri 2:30 a.m. - 4:00 a.m.
|
Poster Session 1 & Lunch Break
(
Poster Session
)
>
|
馃敆 |
Fri 4:00 a.m. - 4:45 a.m.
|
Invited Talk: Tracy Liu
(
Talk
)
>
SlidesLive Video |
馃敆 |
Fri 4:45 a.m. - 4:55 a.m.
|
Preference Learning Algorithms Do Not Learn Preference Rankings
(
Oral
)
>
link
SlidesLive Video |
Angelica Chen 路 Sadhika Malladi 路 Lily Zhang 路 Xinyi Chen 路 Richard Zhang 路 Rajesh Ranganath 路 Kyunghyun Cho 馃敆 |
Fri 4:55 a.m. - 5:05 a.m.
|
Scalable Oversight by Accounting for Unreliable Feedback
(
Oral
)
>
link
SlidesLive Video |
Shivam Singhal 路 Cassidy Laidlaw 路 Anca Dragan 馃敆 |
Fri 5:05 a.m. - 5:50 a.m.
|
Invited Talk: David Lindner
(
Talk
)
>
SlidesLive Video |
馃敆 |
Fri 5:50 a.m. - 6:30 a.m.
|
Panel Discussion
(
Panel
)
>
SlidesLive Video |
馃敆 |
Fri 6:30 a.m. - 8:00 a.m.
|
Poster Session 2 & Coffee & Snacks
(
Poster Session
)
>
|
馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Learning the eye of the beholder: Statistical modeling and estimation for personalized color perception ( Poster ) > link | Xuanzhou Chen 路 Austin Xu 路 Jingyan Wang 路 Ashwin Pananjady 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Scalably Solving Assistance Games ( Poster ) > link | Cassidy Laidlaw 路 Eli Bronstein 路 Timothy Guo 路 Dylan Feng 路 Lukas Berglund 路 Justin Svegliato 路 Stuart Russell 路 Anca Dragan 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Off-Policy Evaluation from Logged Human Feedback ( Poster ) > link | Aniruddha Bhargava 路 Lalit Jain 路 Branislav Kveton 路 Ge Liu 路 Subhojyoti Mukherjee 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Preference Elicitation for Offline Reinforcement Learning ( Poster ) > link | Aliz茅e Pace 路 Bernhard Sch枚lkopf 路 Gunnar Ratsch 路 Giorgia Ramponi 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation ( Poster ) > link |
18 presentersKatie Collins 路 Najoung Kim 路 Yonatan Bitton 路 Verena Rieser 路 Shayegan Omidshafiei 路 Yushi Hu 路 Sherol Chen 路 Senjuti Dutta 路 Minsuk Chang 路 Kimin Lee 路 Youwei Liang 路 Georgina Evans 路 Sahil Singla 路 Gang Li 路 Adrian Weller 路 Junfeng He 路 Deepak Ramachandran 路 Krishnamurthy Dvijotham |
Fri 8:00 a.m. - 8:00 a.m.
|
AI Alignment with Changing and Influenceable Reward Functions ( Poster ) > link | Micah Carroll 路 Davis Foote 路 Anand Siththaranjan 路 Stuart Russell 路 Anca Dragan 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Concept-Based Interpretable Reinforcement Learning with Limited to No Human Labels ( Poster ) > link | Zhuorui Ye 路 Stephanie Milani 路 Fei Fang 路 Geoff Gordon 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Learning to Assist Humans without Inferring Rewards ( Poster ) > link | Vivek Myers 路 Evan Ellis 路 Benjamin Eysenbach 路 Sergey Levine 路 Anca Dragan 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Uncertainty-aware Preference Alignment in Reinforcement Learning from Human Feedback ( Poster ) > link | Sheng Xu 路 Bo Yue 路 Hongyuan Zha 路 Guiliang Liu 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Reinforcement Learning from Human Text Feedback: Learning a Reward Model from Human Text Input ( Poster ) > link | Belen Martin Urcelay 路 Andreas Krause 路 Giorgia Ramponi 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Language Alignment via Nash-learning and Adaptive feedback ( Poster ) > link | Ari Azarafrooz 路 Farshid Faal 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping ( Poster ) > link |
12 presentersHaoyu Wang 路 Guozheng Ma 路 Ziqiao Meng 路 Zeyu Qin 路 Li Shen 路 Zhong Zhang 路 Bingzhe Wu 路 Liu Liu 路 Yatao Bian 路 Tingyang Xu 路 Xueqian Wang 路 Peilin Zhao |
Fri 8:00 a.m. - 8:00 a.m.
|
Multi-Agent Imitation Learning: Value is Easy, Regret is Hard ( Poster ) > link | Jingwu Tang 路 Gokul Swamy 路 Fei Fang 路 Steven Wu 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Efficient Inverse Reinforcement Learning without Compounding Errors ( Poster ) > link | Nicolas Espinosa Dice 路 Gokul Swamy 路 Sanjiban Choudhury 路 Wen Sun 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Revisiting Successor Features for Inverse Reinforcement Learning ( Poster ) > link | Arnav Kumar Jain 路 Harley Wiltzer 路 Jesse Farebrother 路 Irina Rish 路 Glen Berseth 路 Sanjiban Choudhury 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
DPO Meets PPO: Reinforced Token Optimization for RLHF ( Poster ) > link | Han Zhong 路 Guhao Feng 路 Wei Xiong 路 Xinle Cheng 路 Li Zhao 路 Di He 路 Jiang Bian 路 Liwei Wang 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Models That Prove Their Own Correctness ( Poster ) > link | Noga Amit 路 Shafi Goldwasser 路 Orr Paradise 路 Guy Rothblum 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment ( Poster ) > link | Yuu Jinnai 路 Tetsuro Morimura 路 Kaito Ariu 路 Kenshi Abe 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling ( Poster ) > link | Utsav Singh 路 Wesley A. Suttle 路 Brian Sadler 路 Vinay Namboodiri 路 Amrit Singh Bedi 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
AMBER: An Entropy Maximizing Environment Design Algorithm for Inverse Reinforcement Learning ( Poster ) > link | Paul Nitschke 路 Lars L. Ankile 路 Eura Nofshin 路 Siddharth Swaroop 路 Finale Doshi-Velez 路 Weiwei Pan 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Modeling the Plurality of Human Preferences via Ideal Points ( Poster ) > link | Daiwei Chen 路 Yi Chen 路 Aniket Rege 路 Ramya Vinayak 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Stochastic Concept Bottleneck Models ( Poster ) > link | Moritz Vandenhirtz 路 Sonia Laguna 路 Ri膷ards Marcinkevi膷s 路 Julia Vogt 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms ( Poster ) > link | Rafael Rafailov 路 Yaswanth Chittepu 路 Ryan Park 路 Harshit Sikchi 路 Joey Hejna 路 William Knox 路 Chelsea Finn 路 Scott Niekum 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Informed Meta-Learning ( Poster ) > link | Katarzyna Kobalczyk 路 M van der Schaar 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
DPM: Dual Preferences-based Multi-Agent Reinforcement Learning ( Poster ) > link | Sehyeok Kang 路 Yongsik Lee 路 Se-Young Yun 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Hummer: Towards Limited Competitive Preference Dataset ( Poster ) > link | Li Jiang 路 Yusen Wu 路 Junwu Xiong 路 Jingqing Ruan 路 Yichuan Ding 路 Qingpei Guo 路 zujie wen 路 JUN ZHOU 路 Xiaotie Deng 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Comparing Few to Rank Many: Optimal Design for Learning Preferences ( Poster ) > link | Kiran Thekumparampil 路 Gaurush Hiranandani 路 Kousha Kalantari 路 Shoham Sabach 路 Branislav Kveton 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
MultiScale Policy Learning for Alignment with Long Term Objectives ( Poster ) > link | Richa Rastogi 路 Yuta Saito 路 Thorsten Joachims 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Is poisoning a real threat to LLM alignment? Maybe more so than you think ( Poster ) > link | Pankayaraj Pathmanathan 路 Souradip Chakraborty 路 Xiangyu Liu 路 Yongyuan Liang 路 Furong Huang 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Towards Aligning Language Models with Textual Feedback ( Poster ) > link | Sa眉c Abadal 路 Shehzaad Dhuliawala 路 Keerthiram Murugesan 路 Mrinmaya Sachan 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Bootstrapping Language Models with DPO Implicit Rewards ( Poster ) > link | Changyu Chen 路 Zichen Liu 路 Chao Du 路 Tianyu Pang 路 Qian Liu 路 Arunesh Sinha 路 Pradeep Varakantham 路 Min Lin 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Distributional Preference Alignment of LLMs via Optimal Transport ( Poster ) > link | Igor Melnyk 路 Youssef Mroueh 路 Brian Belgodere 路 Mattia Rigotti 路 Apoorva Nitsure 路 Mikhail Yurochkin 路 Kristjan Greenewald 路 Jiri Navratil 路 Jarret Ross 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Scalable Oversight by Accounting for Unreliable Feedback ( Poster ) > link | Shivam Singhal 路 Cassidy Laidlaw 路 Anca Dragan 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy ( Poster ) > link | Yangfan He 路 Yuxuan Bai 路 TIANYU SHI 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences ( Poster ) > link | Souradip Chakraborty 路 Jiahao Qiu 路 Hui Yuan 路 Alec Koppel 路 Furong Huang 路 Dinesh Manocha 路 Amrit Singh Bedi 路 Mengdi Wang 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Relatively Rational: Learning Utilities and Rationalities Jointly from Pairwise Preferences ( Poster ) > link | Taku Yamagata 路 Tobias Oberkofler 路 Timo Kaufmann 路 Viktor Bengs 路 Eyke H眉llermeier 路 Raul Santos-Rodriguez 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Filtered Direct Preference Optimization ( Poster ) > link | Tetsuro Morimura 路 Mitsuki Sakamoto 路 Yuu Jinnai 路 Kenshi Abe 路 Kaito Ariu 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment ( Poster ) > link | Zhaofeng Wu 路 Ananth Balashankar 路 Yoon Kim 路 Jacob Eisenstein 路 Ahmad Beirami 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Optimal Design for Human Feedback ( Poster ) > link | Subhojyoti Mukherjee 路 Anusha Lalitha 路 Kousha Kalantari 路 Aniket Anand Deshmukh 路 Ge Liu 路 Yifei Ma 路 Branislav Kveton 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Aligning Crowd Feedback via Distributional Preference Reward Modeling ( Poster ) > link | Dexun Li 路 Cong Zhang 路 Kuicai Dong 路 Derrick Goh Xin Deik 路 Ruiming Tang 路 Yong Liu 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Preference Learning Algorithms Do Not Learn Preference Rankings ( Poster ) > link | Angelica Chen 路 Sadhika Malladi 路 Lily Zhang 路 Xinyi Chen 路 Richard Zhang 路 Rajesh Ranganath 路 Kyunghyun Cho 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
New Desiderata for Direct Preference Optimization ( Poster ) > link | Xiangkun Hu 路 Tong He 路 David Wipf 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Accelerating Best-of-N via Speculative Rejection ( Poster ) > link | Ruiqi Zhang 路 Momin Haider 路 Ming Yin 路 Jiahao Qiu 路 Mengdi Wang 路 Peter Bartlett 路 Andrea Zanette 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
A Theoretical Framework for Partially Observed Reward-States in RLHF ( Poster ) > link | Chinmaya Kausik 路 Mirco Mutti 路 Aldo Pacchiano 路 Ambuj Tewari 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Weak-to-Strong Extrapolation Expedites Alignment ( Poster ) > link | Chujie Zheng 路 Ziqi Wang 路 Heng Ji 路 Minlie Huang 路 Nanyun Peng 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Inverse Reinforcement Learning from Demonstrations for LLM Alignment ( Poster ) > link | Hao Sun 路 M van der Schaar 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Order-Optimal Instance-Dependent Bounds for Offline Reinforcement Learning with Preference Feedback ( Poster ) > link | Zhirui Chen 路 Vincent Tan 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization ( Poster ) > link | Hritik Bansal 路 Ashima Suvarna 路 Gantavya Bhatt 路 Nanyun Peng 路 Kai-Wei Chang 路 Aditya Grover 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
RLHF and IIA: Perverse Incentives ( Poster ) > link | Wanqiao Xu 路 Shi Dong 路 Xiuyuan Lu 路 Grace Lam 路 Zheng Wen 路 Benjamin Van Roy 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Aligning Large Language Models with Representation Editing: A Control Perspective ( Poster ) > link | Lingkai Kong 路 Haorui Wang 路 Wenhao Mu 路 Yuanqi Du 路 Yuchen Zhuang 路 Yifei Zhou 路 Yue Song 路 Rongzhi Zhang 路 Kai Wang 路 Chao Zhang 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Cross-Domain Knowledge Transfer for RL via Preference Consistency ( Poster ) > link | Ting-Hsuan Huang 路 Ping-Chun Hsieh 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Is a Good Description Worth a Thousand Pictures? Reducing Multimodal Alignment to Text-Based, Unimodal Alignment ( Poster ) > link | Amin Memarian 路 Touraj Laleh 路 Irina Rish 路 Ardavan S. Nobandegani 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Generalizing Offline Alignment Theoretical Paradigm with Diverse Divergence Constraints ( Poster ) > link | Haoyuan Sun 路 Yuxin Zheng 路 Yifei Zhao 路 Yongzhe Chang 路 Xueqian Wang 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Adversarial Multi-dueling Bandits ( Poster ) > link | Pratik Gajane 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Comparing Comparisons: Informative and Easy Human Feedback with Distinguishability Queries ( Poster ) > link | Xuening Feng 路 Zhaohui Jiang 路 Timo Kaufmann 路 Eyke H眉llermeier 路 Paul Weng 路 Yifei Zhu 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
REBEL: Reinforcement Learning via Regressing Relative Rewards ( Poster ) > link | Zhaolin Gao 路 Jonathan Chang 路 Wenhao Zhan 路 Owen Oertell 路 Gokul Swamy 路 Kiant茅 Brantley 路 Thorsten Joachims 路 Drew Bagnell 路 Jason Lee 路 Wen Sun 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents ( Poster ) > link | David Hyland 路 Tom谩拧 Gaven膷iak 路 Lancelot Da Costa 路 Conor Heins 路 Vojtech Kovarik 路 Julian Gutierrez 路 Michael Wooldridge 路 Jan Kulveit 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Towards Safe Large Language Models for Medicine ( Poster ) > link | Tessa Han 路 Aounon Kumar 路 Chirag Agarwal 路 Himabindu Lakkaraju 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Query Design for Crowdsourced Clustering: Effect of Cognitive Overload and Contextual Bias ( Poster ) > link | Yi Chen 路 Ramya Vinayak 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
"You just can鈥檛 go around killing people'' Explaining Agent Behavior to a Human Terminator ( Poster ) > link | Uri Menkes 路 Ofra Amir 路 Assaf Hallak 馃敆 |
Fri 8:00 a.m. - 8:00 a.m.
|
Prompt Optimization with Human Feedback ( Poster ) > link | Arunesh Sinha 路 See-Kiong Ng 路 Patrick Jaillet 路 Bryan Kian Hsiang Low 路 Xiaoqiang Lin 路 Zhongxiang Dai 馃敆 |
-
|
Hummer: Towards Limited Competitive Preference Dataset ( Oral ) > link | Li Jiang 路 Yusen Wu 路 Junwu Xiong 路 Jingqing Ruan 路 Yichuan Ding 路 Qingpei Guo 路 zujie wen 路 JUN ZHOU 路 Xiaotie Deng 馃敆 |