Maximizing mutual information between prompt and response improves LLM performance with no additional data
Abstract
While post-training has successfully improved large language models across a variety of domains from open-ended text generation to mathematics, these gains heavily rely on human-labeled data or external verifiers. Existing data has already been exploited and new high-quality data is expensive to collect. More fundamentally, true intelligence goes far beyond tasks that are easily verified. Therefore, there is a need for self-improvement frameworks that allow models to improve without external oversight. We propose Mutual Information-based Preference Optimization (MIPO), a contrastive data augmentation method that constructs preference pairs by generating a positive response conditioned on the correct prompt and a negative response conditioned on a random or incomplete prompt; then train with Direct Policy Optimization. We show that this connects to maximizing pointwise mutual information between prompts and model responses under the base policy. Empirical results with the Llama- (1, 3B) and Qwen- (1.5, 3, 7B) Instruct models show that MIPO achieves 4-38\% improvements on personalization tasks from real-user datasets (PRISM, Community Alignment). Surprisingly, MIPO can be more generally applied to a suite of benchmark tasks (e.g., math and multiple-choice answering), yielding 3\% and 18\% improvements for smaller 1B models, without any additional data or labels.