Poster
in
Workshop: Models of Human Feedback for AI Alignment
Towards Aligning Language Models with Textual Feedback
Saüc Abadal · Shehzaad Dhuliawala · Keerthiram Murugesan · Mrinmaya Sachan
We present ALT (ALignment with Textual feedback), an approach that aligns models toward certain user preferences expressed in text. We posit that text allows for an interface for users to provide richer feedback than comparative preferences.In our work, we explore the efficacy and efficiency of textual feedback across several tasks.For the task of reducing model toxicity, we show that even using rule-based feedback can reduce model toxicity 62\% more than PPO in-domain and 52\% out-of-domain. For the task of summarization, we show that \name can match the performance of PPO with only 20\% of the training samples, both in- and out-of-domain.Finally, for the task of aligning dialog to be harmless and helpful, we find that \name can effectively use textual feedback provided by a Large Language Model without the need for a reward model.