Poster Thu, Jul 9, 2026 • 1:00 AM – 2:45 AM PDT HALL A #4601

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

Truong Nguyen ⋅ Tien-Phat Nguyen ⋅ Linh Van ⋅ Duy Nguyen ⋅ Khoa Doan ⋅ Trung Le

Abstract

Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley–Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley--Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.

Lay Summary

Modern AI assistants are often improved by showing them two possible answers and teaching them to prefer the better one. Most current methods judge each answer as a whole, even though the model actually writes one small piece of text at a time, so it can be hard to know which early choices helped or hurt the final answer. We developed TokenRatio, a training method that uses the same kind of “better answer vs. worse answer” data but turns it into guidance for each writing step. Instead of simply rewarding a whole response, our method compares how the better and worse answers are developing as they are written, helping the model make stronger local choices without using costly reinforcement-learning procedures. We also include safeguards so the method does not blame one small decision for differences that are really caused by the surrounding context. In experiments on instruction following, helpful and harmless dialogue, summarization, and reasoning benchmarks, TokenRatio improved alignment quality and training stability while preserving more diverse responses than strong baselines. This matters because better step-by-step training can make AI assistants more reliable, concise, and useful without adding much complexity.