Discrete Tilt Matching
Abstract
Masked diffusion large language models (dLLMs) are a promising alternative to autoregressive generation. While reinforcement learning (RL) algorithms have been adapted to be compatible with dLLMs for fine-tuning them, their reliance on the computation of the marginal likelihood to evaluate policy objectives is intractable. To overcome this, we exploit a dynamical relation between the unmasking posterior of the base model and that which targets the reward-tilted distribution to derive Discrete Tilt Matching (DTM), an algorithm that avoids intractable likelihood evaluation entirely. DTM can be phrased as a cross-entropy loss that only requires forward evaluation of rewards and whose variance can be adaptively controlled, improving training stability. We motivate DTM on maze planning tasks, and show that fine-tuning LLaDA-8B-Instruct with DTM achieves higher accuracy at lower compute costs than prior RL-based fine-tuning methods across the Sudoku, Countdown, and MATH500 benchmarks.