Workshop
Models of Human Feedback for AI Alignment
Thomas Kleine Buening · Harshit Sikchi · Christos Dimitrakakis · Scott Niekum · Constantin Rothkopf · Aadirupa Saha · Lirong Xia
Schubert 4 - 6
Fri 26 Jul, midnight PDT
Aligning AI agents with human intentions and values is one of the main barriers to the safe and ethical application of AI systems in the real world. Current approaches mostly rely on highly questionable assumptions about the meaning of observed human feedback or interactions. These include assumptions about rationality in decision-making and belief forming, homogeneity of the population, and other restrictive feedback assumptions. However, the role of such modeling assumptions has mostly been neglected in the literature on AI alignment. In this workshop, we want to bring together perspectives from various disciplines besides ML, including computational social choice, behavioral psychology, and economics, to share experiences and perspectives on models of human feedback and their importance for human-AI alignment and collaboration.
Schedule
Fri 12:00 a.m. - 12:05 a.m.
|
Opening Remarks
SlidesLive Video |
🔗 |
Fri 12:05 a.m. - 12:50 a.m.
|
Invited Talk: Dylan Hadfield-Menell
(
Talk
)
>
SlidesLive Video |
🔗 |
Fri 12:50 a.m. - 1:00 a.m.
|
AI Alignment with Changing and Influenceable Reward Functions
(
Oral
)
>
link
SlidesLive Video |
Micah Carroll · Davis Foote · Anand Siththaranjan · Stuart Russell · Anca Dragan 🔗 |
Fri 1:00 a.m. - 1:10 a.m.
|
RLHF and IIA: Perverse Incentives
(
Oral
)
>
link
SlidesLive Video |
Wanqiao Xu · Shi Dong · Xiuyuan Lu · Grace Lam · Zheng Wen · Benjamin Van Roy 🔗 |
Fri 1:15 a.m. - 2:00 a.m.
|
Invited Talk: Ariel Procaccia
(
Talk
)
>
SlidesLive Video |
🔗 |
Fri 2:00 a.m. - 2:10 a.m.
|
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
(
Oral
)
>
link
SlidesLive Video |
Souradip Chakraborty · Jiahao Qiu · Hui Yuan · Alec Koppel · Furong Huang · Dinesh Manocha · Amrit Singh Bedi · Mengdi Wang 🔗 |
Fri 2:10 a.m. - 2:20 a.m.
|
Modeling the Plurality of Human Preferences via Ideal Points
(
Oral
)
>
link
SlidesLive Video |
Daiwei Chen · Yi Chen · Aniket Rege · Ramya Vinayak 🔗 |
Fri 2:20 a.m. - 2:30 a.m.
|
Prompt Optimization with Human Feedback
(
Oral
)
>
link
SlidesLive Video |
Xiaoqiang Lin · Zhongxiang Dai · Arun Verma · See-Kiong Ng · Patrick Jaillet · Bryan Kian Hsiang Low 🔗 |
Fri 2:30 a.m. - 4:00 a.m.
|
Poster Session 1 & Lunch Break
(
Poster Session
)
>
|
🔗 |
Fri 4:00 a.m. - 4:45 a.m.
|
Invited Talk: Tracy Liu
(
Talk
)
>
SlidesLive Video |
🔗 |
Fri 4:45 a.m. - 4:55 a.m.
|
Preference Learning Algorithms Do Not Learn Preference Rankings
(
Oral
)
>
link
SlidesLive Video |
Angelica Chen · Sadhika Malladi · Lily Zhang · Xinyi Chen · Richard Zhang · Rajesh Ranganath · Kyunghyun Cho 🔗 |
Fri 4:55 a.m. - 5:05 a.m.
|
Scalable Oversight by Accounting for Unreliable Feedback
(
Oral
)
>
link
SlidesLive Video |
Shivam Singhal · Cassidy Laidlaw · Anca Dragan 🔗 |
Fri 5:05 a.m. - 5:50 a.m.
|
Invited Talk: David Lindner
(
Talk
)
>
SlidesLive Video |
🔗 |
Fri 5:50 a.m. - 6:30 a.m.
|
Panel Discussion
(
Panel
)
>
SlidesLive Video |
🔗 |
Fri 6:30 a.m. - 8:00 a.m.
|
Poster Session 2 & Coffee & Snacks
(
Poster Session
)
>
|
🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Learning the eye of the beholder: Statistical modeling and estimation for personalized color perception ( Poster ) > link | Xuanzhou Chen · Austin Xu · Jingyan Wang · Ashwin Pananjady 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Scalably Solving Assistance Games ( Poster ) > link | Cassidy Laidlaw · Eli Bronstein · Timothy Guo · Dylan Feng · Lukas Berglund · Justin Svegliato · Stuart Russell · Anca Dragan 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Off-Policy Evaluation from Logged Human Feedback ( Poster ) > link | Aniruddha Bhargava · Lalit Jain · Branislav Kveton · Ge Liu · Subhojyoti Mukherjee 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Preference Elicitation for Offline Reinforcement Learning ( Poster ) > link | Alizée Pace · Bernhard Schölkopf · Gunnar Ratsch · Giorgia Ramponi 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation ( Poster ) > link |
18 presentersKatie Collins · Najoung Kim · Yonatan Bitton · Verena Rieser · Shayegan Omidshafiei · Yushi Hu · Sherol Chen · Senjuti Dutta · Minsuk Chang · Kimin Lee · Youwei Liang · Georgina Evans · Sahil Singla · Gang Li · Adrian Weller · Junfeng He · Deepak Ramachandran · Krishnamurthy Dvijotham |
Fri 8:00 a.m. - 8:00 a.m.
|
AI Alignment with Changing and Influenceable Reward Functions ( Poster ) > link | Micah Carroll · Davis Foote · Anand Siththaranjan · Stuart Russell · Anca Dragan 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Concept-Based Interpretable Reinforcement Learning with Limited to No Human Labels ( Poster ) > link | Zhuorui Ye · Stephanie Milani · Fei Fang · Geoff Gordon 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Learning to Assist Humans without Inferring Rewards ( Poster ) > link | Vivek Myers · Evan Ellis · Benjamin Eysenbach · Sergey Levine · Anca Dragan 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Uncertainty-aware Preference Alignment in Reinforcement Learning from Human Feedback ( Poster ) > link | Sheng Xu · Bo Yue · Hongyuan Zha · Guiliang Liu 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Reinforcement Learning from Human Text Feedback: Learning a Reward Model from Human Text Input ( Poster ) > link | Belen Martin Urcelay · Andreas Krause · Giorgia Ramponi 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Language Alignment via Nash-learning and Adaptive feedback ( Poster ) > link | Ari Azarafrooz · Farshid Faal 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping ( Poster ) > link |
12 presentersHaoyu Wang · Guozheng Ma · Ziqiao Meng · Zeyu Qin · Li Shen · Zhong Zhang · Bingzhe Wu · Liu Liu · Yatao Bian · Tingyang Xu · Xueqian Wang · Peilin Zhao |
Fri 8:00 a.m. - 8:00 a.m.
|
Multi-Agent Imitation Learning: Value is Easy, Regret is Hard ( Poster ) > link | Jingwu Tang · Gokul Swamy · Fei Fang · Steven Wu 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Efficient Inverse Reinforcement Learning without Compounding Errors ( Poster ) > link | Nicolas Espinosa Dice · Gokul Swamy · Sanjiban Choudhury · Wen Sun 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Revisiting Successor Features for Inverse Reinforcement Learning ( Poster ) > link | Arnav Kumar Jain · Harley Wiltzer · Jesse Farebrother · Irina Rish · Glen Berseth · Sanjiban Choudhury 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
DPO Meets PPO: Reinforced Token Optimization for RLHF ( Poster ) > link | Han Zhong · Guhao Feng · Wei Xiong · Xinle Cheng · Li Zhao · Di He · Jiang Bian · Liwei Wang 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Models That Prove Their Own Correctness ( Poster ) > link | Noga Amit · Shafi Goldwasser · Orr Paradise · Guy Rothblum 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment ( Poster ) > link | Yuu Jinnai · Tetsuro Morimura · Kaito Ariu · Kenshi Abe 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling ( Poster ) > link | Utsav Singh · Wesley A. Suttle · Brian Sadler · Vinay Namboodiri · Amrit Singh Bedi 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
AMBER: An Entropy Maximizing Environment Design Algorithm for Inverse Reinforcement Learning ( Poster ) > link | Paul Nitschke · Lars L. Ankile · Eura Nofshin · Siddharth Swaroop · Finale Doshi-Velez · Weiwei Pan 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Modeling the Plurality of Human Preferences via Ideal Points ( Poster ) > link | Daiwei Chen · Yi Chen · Aniket Rege · Ramya Vinayak 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Stochastic Concept Bottleneck Models ( Poster ) > link | Moritz Vandenhirtz · Sonia Laguna · Ričards Marcinkevičs · Julia Vogt 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms ( Poster ) > link | Rafael Rafailov · Yaswanth Chittepu · Ryan Park · Harshit Sikchi · Joey Hejna · William Knox · Chelsea Finn · Scott Niekum 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Informed Meta-Learning ( Poster ) > link | Katarzyna Kobalczyk · M van der Schaar 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
DPM: Dual Preferences-based Multi-Agent Reinforcement Learning ( Poster ) > link | Sehyeok Kang · Yongsik Lee · Se-Young Yun 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Hummer: Towards Limited Competitive Preference Dataset ( Poster ) > link | Li Jiang · Yusen Wu · Junwu Xiong · Jingqing Ruan · Yichuan Ding · Qingpei Guo · zujie wen · JUN ZHOU · Xiaotie Deng 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Comparing Few to Rank Many: Optimal Design for Learning Preferences ( Poster ) > link | Kiran Thekumparampil · Gaurush Hiranandani · Kousha Kalantari · Shoham Sabach · Branislav Kveton 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
MultiScale Policy Learning for Alignment with Long Term Objectives ( Poster ) > link | Richa Rastogi · Yuta Saito · Thorsten Joachims 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Is poisoning a real threat to LLM alignment? Maybe more so than you think ( Poster ) > link | Pankayaraj Pathmanathan · Souradip Chakraborty · Xiangyu Liu · Yongyuan Liang · Furong Huang 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Towards Aligning Language Models with Textual Feedback ( Poster ) > link | Saüc Abadal · Shehzaad Dhuliawala · Keerthiram Murugesan · Mrinmaya Sachan 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Bootstrapping Language Models with DPO Implicit Rewards ( Poster ) > link | Changyu Chen · Zichen Liu · Chao Du · Tianyu Pang · Qian Liu · Arunesh Sinha · Pradeep Varakantham · Min Lin 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Distributional Preference Alignment of LLMs via Optimal Transport ( Poster ) > link | Igor Melnyk · Youssef Mroueh · Brian Belgodere · Mattia Rigotti · Apoorva Nitsure · Mikhail Yurochkin · Kristjan Greenewald · Jiri Navratil · Jarret Ross 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Scalable Oversight by Accounting for Unreliable Feedback ( Poster ) > link | Shivam Singhal · Cassidy Laidlaw · Anca Dragan 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy ( Poster ) > link | Yangfan He · Yuxuan Bai · TIANYU SHI 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences ( Poster ) > link | Souradip Chakraborty · Jiahao Qiu · Hui Yuan · Alec Koppel · Furong Huang · Dinesh Manocha · Amrit Singh Bedi · Mengdi Wang 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Relatively Rational: Learning Utilities and Rationalities Jointly from Pairwise Preferences ( Poster ) > link | Taku Yamagata · Tobias Oberkofler · Timo Kaufmann · Viktor Bengs · Eyke Hüllermeier · Raul Santos-Rodriguez 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Filtered Direct Preference Optimization ( Poster ) > link | Tetsuro Morimura · Mitsuki Sakamoto · Yuu Jinnai · Kenshi Abe · Kaito Ariu 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment ( Poster ) > link | Zhaofeng Wu · Ananth Balashankar · Yoon Kim · Jacob Eisenstein · Ahmad Beirami 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Optimal Design for Human Feedback ( Poster ) > link | Subhojyoti Mukherjee · Anusha Lalitha · Kousha Kalantari · Aniket Anand Deshmukh · Ge Liu · Yifei Ma · Branislav Kveton 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Aligning Crowd Feedback via Distributional Preference Reward Modeling ( Poster ) > link | Dexun Li · Cong Zhang · Kuicai Dong · Derrick Goh Xin Deik · Ruiming Tang · Yong Liu 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Preference Learning Algorithms Do Not Learn Preference Rankings ( Poster ) > link | Angelica Chen · Sadhika Malladi · Lily Zhang · Xinyi Chen · Richard Zhang · Rajesh Ranganath · Kyunghyun Cho 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
New Desiderata for Direct Preference Optimization ( Poster ) > link | Xiangkun Hu · Tong He · David Wipf 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Accelerating Best-of-N via Speculative Rejection ( Poster ) > link | Ruiqi Zhang · Momin Haider · Ming Yin · Jiahao Qiu · Mengdi Wang · Peter Bartlett · Andrea Zanette 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
A Theoretical Framework for Partially Observed Reward-States in RLHF ( Poster ) > link | Chinmaya Kausik · Mirco Mutti · Aldo Pacchiano · Ambuj Tewari 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Weak-to-Strong Extrapolation Expedites Alignment ( Poster ) > link | Chujie Zheng · Ziqi Wang · Heng Ji · Minlie Huang · Nanyun Peng 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Inverse Reinforcement Learning from Demonstrations for LLM Alignment ( Poster ) > link | Hao Sun · M van der Schaar 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Order-Optimal Instance-Dependent Bounds for Offline Reinforcement Learning with Preference Feedback ( Poster ) > link | Zhirui Chen · Vincent Tan 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization ( Poster ) > link | Hritik Bansal · Ashima Suvarna · Gantavya Bhatt · Nanyun Peng · Kai-Wei Chang · Aditya Grover 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
RLHF and IIA: Perverse Incentives ( Poster ) > link | Wanqiao Xu · Shi Dong · Xiuyuan Lu · Grace Lam · Zheng Wen · Benjamin Van Roy 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Aligning Large Language Models with Representation Editing: A Control Perspective ( Poster ) > link | Lingkai Kong · Haorui Wang · Wenhao Mu · Yuanqi Du · Yuchen Zhuang · Yifei Zhou · Yue Song · Rongzhi Zhang · Kai Wang · Chao Zhang 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Cross-Domain Knowledge Transfer for RL via Preference Consistency ( Poster ) > link | Ting-Hsuan Huang · Ping-Chun Hsieh 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Is a Good Description Worth a Thousand Pictures? Reducing Multimodal Alignment to Text-Based, Unimodal Alignment ( Poster ) > link | Amin Memarian · Touraj Laleh · Irina Rish · Ardavan S. Nobandegani 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Generalizing Offline Alignment Theoretical Paradigm with Diverse Divergence Constraints ( Poster ) > link | Haoyuan Sun · Yuxin Zheng · Yifei Zhao · Yongzhe Chang · Xueqian Wang 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Adversarial Multi-dueling Bandits ( Poster ) > link | Pratik Gajane 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Comparing Comparisons: Informative and Easy Human Feedback with Distinguishability Queries ( Poster ) > link | Xuening Feng · Zhaohui Jiang · Timo Kaufmann · Eyke Hüllermeier · Paul Weng · Yifei Zhu 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
REBEL: Reinforcement Learning via Regressing Relative Rewards ( Poster ) > link | Zhaolin Gao · Jonathan Chang · Wenhao Zhan · Owen Oertell · Gokul Swamy · Kianté Brantley · Thorsten Joachims · Drew Bagnell · Jason Lee · Wen Sun 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents ( Poster ) > link | David Hyland · Tomáš Gavenčiak · Lancelot Da Costa · Conor Heins · Vojtech Kovarik · Julian Gutierrez · Michael Wooldridge · Jan Kulveit 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Towards Safe Large Language Models for Medicine ( Poster ) > link | Tessa Han · Aounon Kumar · Chirag Agarwal · Himabindu Lakkaraju 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Query Design for Crowdsourced Clustering: Effect of Cognitive Overload and Contextual Bias ( Poster ) > link | Yi Chen · Ramya Vinayak 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
"You just can’t go around killing people'' Explaining Agent Behavior to a Human Terminator ( Poster ) > link | Uri Menkes · Ofra Amir · Assaf Hallak 🔗 |
Fri 8:00 a.m. - 8:00 a.m.
|
Prompt Optimization with Human Feedback ( Poster ) > link | Arunesh Sinha · See-Kiong Ng · Patrick Jaillet · Bryan Kian Hsiang Low · Xiaoqiang Lin · Zhongxiang Dai 🔗 |
-
|
Hummer: Towards Limited Competitive Preference Dataset ( Oral ) > link | Li Jiang · Yusen Wu · Junwu Xiong · Jingqing Ruan · Yichuan Ding · Qingpei Guo · zujie wen · JUN ZHOU · Xiaotie Deng 🔗 |