PEBS: Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration
Arnav Raj
Abstract
Reward models for Reinforcement Learning from Human Feedback (RLHF) pool preferences across thousands of annotators and fit one global affine calibrator, collapsing raters with systematically different rating-scale offsets and slopes into a single average rater who is nobody in particular. PEBS is a per-rater empirical-Bayes shrinkage estimator: it fits per-rater affine calibrators on a held-out slice of each annotator's ratings and applies Morris-James-Stein empirical-Bayes shrinkage toward the population mean, in closed form and without retraining the reward model. On PRISM, PEBS reduces within-user held-out RMSE by $\mathbf{8.58\%}$ over the production pop-slope baseline. The procedure replicates on PluriHarms harm ratings with a $\mathbf{+9.66\%}$ in-family gain. PEBS is a closed-form post-hoc estimator for annotator-specific affine calibration in RLHF reward modeling; it leaves the reward base model unchanged and estimates only the rater-level map used at inference time for new ratings.
Successful Page Load