Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Models of Human Feedback for AI Alignment

Towards Aligning Language Models with Textual Feedback

Saüc Abadal · Shehzaad Dhuliawala · Keerthiram Murugesan · Mrinmaya Sachan

[ ] [ Project Page ]
Fri 26 Jul 8 a.m. PDT — 8 a.m. PDT

Abstract:

We present ALT (ALignment with Textual feedback), an approach that aligns models toward certain user preferences expressed in text. We posit that text allows for an interface for users to provide richer feedback than comparative preferences.In our work, we explore the efficacy and efficiency of textual feedback across several tasks.For the task of reducing model toxicity, we show that even using rule-based feedback can reduce model toxicity 62\% more than PPO in-domain and 52\% out-of-domain. For the task of summarization, we show that \name can match the performance of PPO with only 20\% of the training samples, both in- and out-of-domain.Finally, for the task of aligning dialog to be harmless and helpful, we find that \name can effectively use textual feedback provided by a Large Language Model without the need for a reward model.

Chat is not available.