Beyond the Mean: Three-Axis Fidelity for Aligning LLM-Based Survey Simulators from Small Pilot Data
Abstract
Large language models (LLMs) are increasingly used to simulate social survey responses, yet their outputs exhibit systematic biases: marginal distributions are skewed, response variance is poorly calibrated, and predictor--outcome relationships are attenuated. We ask a simple question: given a small pilot sample of human responses, can an LLM recover the broader population? Using a COVID-19 misinformation survey, we benchmark three families of approaches: prompting, PPI (Prediction-Powered Inference) rectification, and PEFT (parameter-efficient fine-tuning). We decompose recovery along three axes: marginal fidelity, defined as cross-respondent distributional similarity; structural fidelity, defined as alignment in predictor--outcome relationships; and individual fidelity, defined as agreement on per-respondent summaries. PEFT applying LoRA adapters with an MLP classifier head performed best across nearly all axes. These findings suggest that fine-tuning on small pilot samples offers a balanced approach for achieving multiple forms of fidelity.