Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach
Abstract
Large language models (LLMs) are increasingly deployed as agents for decision-making (DM) in interactive and dynamic environments. However, since they are not originally designed for DM, recent studies show that LLMs struggle in basic online DM settings. We introduce ITERATIVE REGRET-MINIMIZATION FINE-TUNING (ITERATIVE RMFT), a post-training procedure that repeatedly distills low-regret decision trajectories into the base model. Unlike prior methods that rely on distilling known algorithms or enforcing manually designed reasoning formats, our approach leverages regret as a training signal to elicit improved decision-making behavior while incorporating model-generated reasoning in natural language. Empirically, ITERATIVE RMFT improves DM performance across models, including numerical Transformers, lightweight open-weight LLMs, and the closed-weight model GPT-4o mini, while exhibiting generalization across varying horizons, action spaces, reward processes, and natural-language-described DM scenarios. Overall, we position our approach as an initial exploration, calling for more principled and novel post-training paradigms for LLMs when it comes to addressing DM tasks.