Inference-time Alignment with Rewards in Besov Spaces: Provable Advantages of Feature Learning and Multi-Step Policy Updates
Abstract
Inference-time alignment, the approach of adapting pre-trained models to reward feedback during inference, has proven empirically effective at improving language-model performance. Despite its success, theoretical foundations remain underdeveloped, especially in practical settings where neural networks are employed as reward models. In this paper, we explore the advantages of neural networks and how to effectively train them for inference-time alignment. Assuming that the true reward function lies in Besov spaces to capture the non-uniform smoothness, we compare neural networks to linear estimators and show that feature learning capability of neural networks is crucial for improving performance. We further analyze algorithms for training neural-network reward estimators. Specifically, we consider a multi-step algorithm that alternates between sampling from the current policy and refitting the reward estimator, and prove that it improves the regret, especially when the true reward exhibits local structure.