Invited Talk
in
Workshop: 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)

Personalization and pluralistic alignment of LLMs via reinforcement learning fine-tuning

Natasha Jaques

2025 Invited Talk
in
Workshop: 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)

Project Page

Abstract

A truly effective Large Language Model (LLM) must be capable of engaging in conversations that are not only coherent and contextually appropriate, but also adapt to the unique preferences and style of each user. However, existing approaches for training LLMs based on Reinforcement Learning from Human Feedback (RLHF) assume a unimodal distribution of population preferences. Recent works have shown that not only do users have diverse preferences, but that when existing RLHF techniques average over them, it can lead to poor performance for all users. In this talk, I will describe three techniques for effectively personalizing RL fine-tuning to different users. VPL embeds user’s past conversation histories into a vector representation to condition the reward model, enabling personalization. However, it can be challenging to train an LLM to compress a large amount of text into an embedding vector without losing performance. Instead, we use reinforcement learning to learn how to compress user histories into a compact text-based summary to condition the reward model, enhancing both personalization and interpretability. Finally, we propose a new multi-turn RL curiosity objective for learning how to learn about the user during the conversation. Our model learns to ask questions to improve the accuracy of its model of the user, enabling it to effectively generalize to new users when vanilla multi-turn RLHF cannot. Together, these works make progress towards adaptive, personalized LLMs that work for all users, not just the average user.

Video

Chat is not available.