Poster
in
Workshop: Workshop on Theoretical Foundations of Foundation Models (TF2M)

Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment

Jiaxiang Li · Siliang Zeng · Hoi To Wai · Chenliang Li · Alfredo Garcia · Mingyi Hong

Project Page [ Poster] [ OpenReview]

Abstract

Aligning human preference and value is important for contemporary foundation models. State-of-the-art techniques such as Reinforcement Learning from Human Feedback (RLHF) consist of two stages: 1) supervised fine-tuning (SFT), where the model is fine-tuned to imitate human demonstration data; 2) Preference learning, where preference data is used to learn a reward model, which is then used by a reinforcement learning (RL) step to fine-tune the model. In this work, we argue that the SFT stage benefits from learning a reward model as well. Instead of using the human demonstration data directly via supervised learning, we propose to leverage an Inverse RL (IRL) technique to build an reward model, while learning the policy model. This approach leads to new SFT algorithms that are not only efficient to implement, but also promote the ability to distinguish between preferred and non-preferred continuations. Our results indicate that it is beneficial to explicitly or implicitly leverage reward learning throughout the entire alignment process.

Chat is not available.