ICML Reinforcement Learning from Human Text Feedback: Learning a Reward Model from Human Text Input

Poster
in
Workshop: Models of Human Feedback for AI Alignment

Reinforcement Learning from Human Text Feedback: Learning a Reward Model from Human Text Input

Belen Martin Urcelay · Andreas Krause · Giorgia Ramponi

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

We explore the use of human-generated text inputs to model rewards in Reinforcement Learning with Human Feedback (RLHF). Human text contains rich and nuanced information, yet most previous work relies on preference feedback or restricts the text structure. We propose using Large Language Models (LLMs) as a way of harnessing the information from natural text to train a reward model efficiently. Our empirical evaluations demonstrate the advantages of this approach in both tabular and continuous reinforcement learning tasks. The results show that even with minimal human interactions, integrating text feedback with LLMs enables our method to approximate the reward function accurately, leading to significant performance improvements.

Chat is not available.

Poster in Workshop: Models of Human Feedback for AI Alignment

Reinforcement Learning from Human Text Feedback: Learning a Reward Model from Human Text Input

Belen Martin Urcelay · Andreas Krause · Giorgia Ramponi

Poster
in
Workshop: Models of Human Feedback for AI Alignment