Improving LLM-Based Recommenders with Conservative Generative Flow Networks
Abstract
Generative Flow Networks (GFlowNets) have recently been used to improve diversity and mitigate popularity bias in LLM-based recommender systems, yet most objectives are developed under online-style assumptions. In offline LLM-based recommendation, learning is constrained to a fixed logged dataset, yielding partial support over token transitions on the dataset-induced token-prefix DAG. Naively applying Sub-Trajectory Balance (SubTB) becomes non-identifiable and can arbitrarily allocate probability mass to unsupported regions. We formalize this failure and identify three sources of non-identifiability that induce distributional shift between the dataset-implied policy and the learned policy: (i) flow overestimation, (ii) forward mass leakage, and (iii) backward compensation. To address it, we propose CFlower, which introduces a conservative SubTB objective that explicitly penalizes unsupported forward flow mass, and combines it with dataset-constrained policy learning with on-policy sampling on the dataset-induced DAG for efficient training under offline constraints. Experiments on three Amazon recommendation datasets show that CFlower improves distributional matching and delivers a stronger accuracy--exposure trade-off than prior GFlowNet and SFT baselines, while serving as a more reliable reference policy for downstream RL fine-tuning.