Poster
in
Workshop: Models of Human Feedback for AI Alignment
Reinforcement Learning from Human Text Feedback: Learning a Reward Model from Human Text Input
Belen Martin Urcelay · Andreas Krause · Giorgia Ramponi
We explore the use of human-generated text inputs to model rewards in Reinforcement Learning with Human Feedback (RLHF). Human text contains rich and nuanced information, yet most previous work relies on preference feedback or restricts the text structure. We propose using Large Language Models (LLMs) as a way of harnessing the information from natural text to train a reward model efficiently. Our empirical evaluations demonstrate the advantages of this approach in both tabular and continuous reinforcement learning tasks. The results show that even with minimal human interactions, integrating text feedback with LLMs enables our method to approximate the reward function accurately, leading to significant performance improvements.