Poster
in
Workshop: Workshop on Theoretical Foundations of Foundation Models (TF2M)
SAIL: Self-improving Efficient Online Alignment of Large Language Models
Mucong Ding · Souradip Chakraborty · Vibhu Agrawal · Zora Che · Alec Koppel · Mengdi Wang · Amrit Singh Bedi · Furong Huang
Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. Current offline RLHF methods rely on fixed preference datasets, which can lead to sub-optimal performance. Current online RLHF methods lack a unified conceptual formulation and suffer from distribution shifts. We establish that online LLM alignment is underpinned by bilevel optimization. By reducing this formulation to an efficient single-level first-order method (using the reward-policy equivalence), our approach generates new samples and iteratively refines model alignment. Thus, we perform alignment in an online and self-improving manner and generalize prior online RLHF methods as special cases. We significantly improve alignment performance on open-sourced datasets with minimal computational overhead.