Poster
in
Workshop: Models of Human Feedback for AI Alignment
Generalizing Offline Alignment Theoretical Paradigm with Diverse Divergence Constraints
Haoyuan Sun · Yuxin Zheng · Yifei Zhao · Yongzhe Chang · Xueqian Wang
Abstract:
The enhanced capabilities of large language models (LLMs) necessitate effective AI alignment. Learning from preference-based feedback has recently become popular as a promising approach to align large language models with human preference. Despite the impressive capabilities demonstrated by these aligned models across various tasks, they lack a unified theoretical framework for expression and deeper theoretical understanding. In this work, we propose the unified theoretical paradigm on human preference-based optimization, known as the Unified Preference Optimization (UPO), which can be proven as the generalization of $\Psi$PO. Through understanding of Unified Preference Optimization (UPO), we can obtain a deeper theoretical comprehension of the practical algorithms, as UPO serves as a generalization for them. Furthermore, we explore a specific scenario of UPO by simply setting the mapping to the Identity. By employing this method, we develop a novel practical algorithm, with the name of Identity Unified Preference Optimization (IUPO). It can be demonstrated that IUPO serves as a generalization of IPO under diverse divergence constraints. Our experiments comparing JS-divergence based IUPO to IPO on the fine-tuning task of GPT2 demonstrate that IUPO, particularly JS-IUPO, outperforms IPO.
Chat is not available.