Poster Mon, Jul 6, 2026 • 6:30 PM – 8:15 PM PDT HALL A #4401

Reward Shaping for (Inference-Time) Alignment: A Stackelberg Game Perspective

Haichuan Wang ⋅ Tao Lin ⋅ Lingkai Kong ⋅ Ce Li ⋅ Hezi Jiang ⋅ Milind Tambe

Abstract

Existing alignment methods directly use the reward model learned from user preference data to optimize an LLM policy, subject to KL regularization with respect to the base policy. This practice is suboptimal for maximizing user's utility because the KL regularization may cause the LLM to inherit the bias in the base policy that conflicts with user preferences. While amplifying rewards for preferred outputs can mitigate this bias, it also increases the risk of reward hacking. This tradeoff motivates the problem of optimally designing reward models under KL regularization. We formalize this reward model optimization problem as a Stackelberg game, and show that a simple reward shaping scheme can effectively approximate the optimal reward model. We empirically evaluate our method in inference-time alignment settings and demonstrate that it integrates seamlessly into existing alignment methods with minimal overhead. Our method consistently improves average reward and achieves win–tie rates exceeding 66\% against all baselines, averaged across evaluation settings.

Lay Summary

Reward-based alignment uses rewards to incentivize desirable behavior from an LLM and penalize undesirable behavior. In existing pipelines, a reward model provider, such as an alignment team, learns a user’s preferences and then directly uses those preferences to align the model’s behavior. However, we show that this approach is generally suboptimal because it does not account for how much the user’s preferences differ from the LLM’s original tendencies. Intuitively, when this disagreement is large, the reward signal may need to be strengthened to provide sufficient incentive for the model to change its behavior. We develop a game-theoretic model to study what reward model should be used for alignment, and propose modifying the reward during alignment through a procedure called reward shaping.