Convex Optimization for Alignment and Preference Learning on a Single GPU
Miria Feng ⋅ Mert Pilanci
Abstract
Fine-tuning large language models (LLMs) to align with human preferences has driven the success of systems such as Gemini and ChatGPT. However, approaches like Reinforcement Learning from Human Feedback (RLHF) remain computationally expensive and complex. Direct Preference Optimization (DPO) offers a simpler alternative but has limitations such as inconsistent ranking accuracy, high dependence on expensive GPU resources, and sensitive hyperparameter tuning. We propose the Convex Optimization for Alignment and Preference Learning Algorithm (COALA): a novel lightweight strategy with strong theoretical guarantees. By leveraging the convex optimization reformulation of neural networks, COALA eliminates the need for a reference model and obtains significant reduction in both training time and VRAM consumption, thus enabling efficient training on a single GPU. Experiments across three datasets—including a 23,228-sample synthetic Educational Feedback dataset—and five models (including LLaMA-8B) demonstrate COALA's competitive performance and efficiency in utilizing as little as ${\approx}17.6$% of DPO's total TFLOPS. COALA exhibits stable, monotonically increasing rewards and reaches peak margins in significantly less time than traditional methods such as DPO and ORPO. To the best of our knowledge, this is the first time convex optimization has been effectively applied to preference fine-tuning of LLMs.
Successful Page Load