Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Automated Reinforcement Learning: Exploring Meta-Learning, AutoML, and LLMs

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Shenao Zhang · Donghan Yu · Hiteshi Sharma · Ziyi Yang · Shuohang Wang · Hany Hassan Awadalla · Zhaoran Wang

[ ] [ Project Page ]
Sat 27 Jul 1 a.m. PDT — 2 a.m. PDT

Abstract:

Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named \textit{Self-Exploring Language Models} (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to \textit{Direct Preference Optimization} (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks in different settings.

Chat is not available.