Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Theoretical Foundations of Foundation Models (TF2M)

Unavoidable Learning Constraints Alter the Foundations of Direct Preference Optimization

David Wipf


Abstract:

Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO). Although effective in certain real-world settings, we detail how the foundational DPO reparameterization no longer holds once inevitable learning constraints are introduced during model training. This then motivates alternative derivations and analysis of DPO that remain intact even in the presence of such constraints. As initial steps in this direction, we re-derive DPO from a simple Gaussian estimation perspective, with strong ties to classical constrained optimization problems involving noise-adaptive, concave regularization.

Chat is not available.