GRPO-Trained Memory Policies for Few-Shot Function Calling
Omar Florez
Abstract
Modern language-model agents must support a growing catalog of tools, yet every approach to adding a new tool today incurs either large training cost (supervised fine-tuning) or large per-query inference cost (in-context documentation). We propose a framework that trains memory read and write policies with Group Relative Policy Optimization (GRPO) over a frozen language model, disentangling representation learning from memory management. The controller learns when to extract tool descriptions from a handful of demonstrations, what compact structure to write into an external memory store, and which entries to retrieve at inference. The base model receives no gradients during any phase. The framework generalizes retrieval-augmented generation by replacing similarity-based reads and passive writes with learned policies trained from verifiable tool-use reward, and adds an action that records failure modes alongside descriptions. On three few-shot function-calling benchmarks (BFCL-v3, $\tau$-bench, NexusBench) under a strict held-out tool protocol, a 200M-parameter controller over a frozen Llama 3.1 8B model exceeds full supervised fine-tuning of the same backbone at $K{=}5$ demonstrations while using 40 times fewer trainable parameters and ten times fewer inference tokens. An ablation shows that updating the base model's weights during training improves in-distribution accuracy but degrades few-shot generalization, providing direct empirical justification for the frozen-backbone commitment.
Successful Page Load