Timezone: »
Reward Collapse in Aligning Large Language Models: A Prompt-Aware Approach to Preference Rankings
Ziang Song · Tianle Cai · Jason Lee · Weijie Su
Event URL: https://openreview.net/forum?id=dpWxK6aqIK »
The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences represented as rankings of responses to prompts. In this paper, we document the phenomenon of $\textit{reward collapse}$, an empirical observation where the prevailing ranking-based approach results in an $\textit{identical}$ reward distribution for diverse prompts during the terminal phase of training. This outcome is undesirable as open-ended prompts like ``write a short story about your best friend'' should yield a continuous range of rewards for their completions, while specific prompts like ``what is the capital city of New Zealand'' should generate either high or low rewards. Our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the ranking-based objective function to incorporate prompt-related information during optimization. This insight allows us to derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic setting. To overcome reward collapse, we introduce a prompt-aware optimization scheme that provably admits a prompt-dependent reward distribution within the interpolating regime. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences represented as rankings of responses to prompts. In this paper, we document the phenomenon of $\textit{reward collapse}$, an empirical observation where the prevailing ranking-based approach results in an $\textit{identical}$ reward distribution for diverse prompts during the terminal phase of training. This outcome is undesirable as open-ended prompts like ``write a short story about your best friend'' should yield a continuous range of rewards for their completions, while specific prompts like ``what is the capital city of New Zealand'' should generate either high or low rewards. Our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the ranking-based objective function to incorporate prompt-related information during optimization. This insight allows us to derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic setting. To overcome reward collapse, we introduce a prompt-aware optimization scheme that provably admits a prompt-dependent reward distribution within the interpolating regime. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
Author Information
Ziang Song (Stanford University)
Tianle Cai (Princeton University)
Jason Lee (Princeton University)
Weijie Su (University of Pennsylvania)
More from the Same Authors
-
2021 : On the Convergence of Deep Learning with Differential Privacy »
Zhiqi Bu · Hua Wang · Qi Long · Weijie Su -
2023 : Teaching Arithmetic to Small Transformers »
Nayoung Lee · Kartik Sreenivasan · Jason Lee · Kangwook Lee · Dimitris Papailiopoulos -
2023 : Scaling In-Context Demonstrations with Structured Attention »
Tianle Cai · Kaixuan Huang · Jason Lee · Mengdi Wang · Danqi Chen -
2023 : Fine-Tuning Language Models with Just Forward Passes »
Sadhika Malladi · Tianyu Gao · Eshaan Nichani · Jason Lee · Danqi Chen · Sanjeev Arora -
2023 : Provable Offline Reinforcement Learning with Human Feedback »
Wenhao Zhan · Masatoshi Uehara · Nathan Kallus · Jason Lee · Wen Sun -
2023 : Provable Offline Reinforcement Learning with Human Feedback »
Wenhao Zhan · Masatoshi Uehara · Nathan Kallus · Jason Lee · Wen Sun -
2023 : How to Query Human Feedback Efficiently in RL? »
Wenhao Zhan · Masatoshi Uehara · Wen Sun · Jason Lee -
2023 : 🎤 Fine-Tuning Language Models with Just Forward Passes »
Sadhika Malladi · Tianyu Gao · Eshaan Nichani · Alex Damian · Jason Lee · Danqi Chen · Sanjeev Arora -
2023 : How to Query Human Feedback Efficiently in RL? »
Wenhao Zhan · Masatoshi Uehara · Wen Sun · Jason Lee -
2023 Poster: Efficient displacement convex optimization with particle gradient descent »
Hadi Daneshmand · Jason Lee · Chi Jin -
2023 Poster: Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning »
Yulai Zhao · Zhuoran Yang · Zhaoran Wang · Jason Lee -
2023 Poster: Computationally Efficient PAC RL in POMDPs with Latent Determinism and Conditional Embeddings »
Masatoshi Uehara · Ayush Sekhari · Jason Lee · Nathan Kallus · Wen Sun -
2023 Poster: Looped Transformers as Programmable Computers »
Angeliki Giannou · Shashank Rajput · Jy-yong Sohn · Kangwook Lee · Jason Lee · Dimitris Papailiopoulos -
2023 Poster: Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing »
Jikai Jin · Zhiyuan Li · Kaifeng Lyu · Simon Du · Jason Lee -
2023 Poster: The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent »
Lei Wu · Weijie Su -
2022 Poster: ROCK: Causal Inference Principles for Reasoning about Commonsense Causality »
Jiayao Zhang · Hongming ZHANG · Weijie Su · Dan Roth -
2022 Spotlight: ROCK: Causal Inference Principles for Reasoning about Commonsense Causality »
Jiayao Zhang · Hongming ZHANG · Weijie Su · Dan Roth -
2021 Poster: Oneshot Differentially Private Top-k Selection »
Gang Qiao · Weijie Su · Li Zhang -
2021 Spotlight: Oneshot Differentially Private Top-k Selection »
Gang Qiao · Weijie Su · Li Zhang -
2021 Poster: Toward Better Generalization Bounds with Locally Elastic Stability »
Zhun Deng · Hangfeng He · Weijie Su -
2021 Spotlight: Toward Better Generalization Bounds with Locally Elastic Stability »
Zhun Deng · Hangfeng He · Weijie Su -
2020 Poster: Sharp Composition Bounds for Gaussian Differential Privacy via Edgeworth Expansion »
Qinqing Zheng · Jinshuo Dong · Qi Long · Weijie Su -
2020 Poster: Towards Understanding the Dynamics of the First-Order Adversaries »
Zhun Deng · Hangfeng He · Jiaoyang Huang · Weijie Su