Poster

The Viscosity of Logic: Phase Transitions and Hysteresis in DPO Alignment

Marco Pollanen

Abstract

Direct Preference Optimization (DPO) is often tuned as if increasing alignment pressure (controlled by $\beta$) yields progressively “better” behavior. We instead treat $\beta$ as a control parameter and densely sweep it for three 7B open-weight families under a fixed DPO recipe. In Mistral, capability is sharply non-monotonic: aggregated logic-probe margins become positive only in a narrow band near $\beta \approx 10^{-2}$ and revert outside it, with boundary points that are seed-sensitive. Across architectures under the same sweep, we observe qualitatively different response modes: sharp reorganization in Mistral, selective changes in Llama, and smooth trade-offs in Qwen. Critically, the DPO preference margin can anticorrelate with reasoning capability (Pearson $r=-0.91$ for Llama logic), so margin-based selection can prefer capability-impaired models. Training path also matters: exposure to high $\beta$ induces capability losses that persist even after $\beta$ is reduced (hysteresis). These findings motivate capability-resolved evaluation across the $\beta$ landscape rather than reliance on margins or aggregate benchmarks.