Position: Agentic Safety is an Epistemic Property, Not a Behavioral One
Abstract
Contemporary AI safety is increasingly a full-stack discipline. It spans pretraining interventions, post-training alignment (instruction tuning, RLHF and preference-optimization variants), and deployment-time controls (guardrails, monitoring, and red-teaming). This paper argues these efforts optimize the wrong primary target when it comes to self-improving agents: behavioral compliance today rather than teachability tomorrow. Building on the concept of the utility-learning tension formalized by Wang et al., we argue that utility-driven self-modification can erode learnability itself, yielding structural incorrigibility as an emergent consequence of optimization. We therefore call for a shift in priorities from behavioral alignment to enforceable learnability floors that preserve long-run corrigibility under bounded intervention.