Directly Optimizing Natural Language Explanations for Behavioral Faithfulness: Simulatability and Recoverability
Abstract
Natural-language explanations are widely used to interpret machine learning models, yet many prioritize human plausibility over accurately reflecting or predicting model behavior. Prior approaches often rely on human-written rationales, producing post-hoc explanations that neither align with the model’s decision function nor generalize. We introduce OPEX , a natural-language explanation model that directly optimizes for behavioral faithfulness: the ability of an explanation to reflect and predict a model’s observable input–output behavior. OPEX is trained using reinforcement learning with Group Relative Policy Optimization (GRPO), optimizing two complementary metrics: recoverability, which measures whether explanations recover model predictions on seen examples, and simulatability, which measures prediction of model behavior on unseen inputs. Across structured and text-based tasks, OPEX achieves high simulatability (∼0.85) and recoverability (∼0.99), outperforming GPT-4o, LLaMA-3.3-70B, and human-written explanations; despite having a 8B-parameter backbone. Human user studies show a 15% improvement in classification accuracy over competent baselines