PsychoAlign: Interpretable AI Alignment through Negotiated Multi-Agent Architecture
Abstract
Current AI alignment approaches such as RLHF, Constitutional AI, and DPO embed safety as a monolithic constraint in a single model, offering limited interpretability when alignment decisions are genuinely contested. We propose PsychoAlign, a tripartite multi-agent architecture decomposing alignment into asymmetric roles: a Drive agent maximizing utility without ethical constraints, an Evaluator enforcing compliance against swappable moral frameworks, and a Mediator arbitrating via the A2A protocol. This decomposition enables a capability that monolithic systems structurally cannot provide: alignment boundary detection. The inter-agent tension T(r) automatically maps alignment difficulty; competing-values dilemmas generate mean T=0.360, the highest across domains, with test-retest reliability ICC = 0.891, confirming tension is a stable property of request content. On quality benchmarks, PsychoAlign achieves MT-Bench 0.852, exceeding Constitutional AI (0.792) and a role-ablated no-asymmetry baseline by +0.110, directly confirming that asymmetric decomposition adds measurable value beyond generic multi-model pipelines. We discuss limitations honestly, including a capability-safety tradeoff on adversarial benchmarks and a defense-mechanism selection gap that motivates future work. All code, prompts, and negotiation traces are publicly released.