Normative Alignment for Agentic Integrity - Moving from Behavioral Guardrails to Principled Agency
Abstract
How can agents make safe autonomous decisions in complex, dynamic environments? While significant progress has been made in establishing post-training guardrails to enforce conversational compliance in generative models, these rigid constraints often prove brittle in open-world environments. I argue that achieving generalizable agentic safety requires Normative Alignment: a new paradigm that moves beyond passive harm avoidance to equip autonomous systems with Agentic Integrity. This approach provides agents with the structural capability to interpret, reason through, and dynamically apply abstract principles when literal instructions fail. Realizing this paradigm presents a triple challenge of capability, measurement, and governance. First, it requires a shift in model capability toward normative competence beyond generic reward maximization, moving toward the contextual reasoning needed to adjudicate complex trade-offs in non-verifiable domains. Second, it demands new metrics that move optimization targets beyond immediate preference satisfaction toward long-term human well-being. Third, it requires deliberative governance to ensure these systems avoid top-down paternalism by grounding alignment targets in pluralistic, representative societal input.
Speaker