Towards Complete Multi-Agent Coordination Policy Learning via Denoising Maximum Entropy Optimization
Abstract
Parameter sharing is a widely used technique in Multi-Agent Reinforcement Learning (MARL) that enhances sample efficiency by equipping agents with a unified policy. While effective in homogeneous settings, it often struggles in heterogeneous environments where agents possess diverse capabilities. Conversely, learning customized policies for agents can resolve knowledge conflicts but significantly hinders knowledge transfer, thereby reducing learning efficiency. Existing approaches attempt to balance this trade-off using clustering or agent-specific masks, but they typically rely on strong environment-specific priors and struggle in settings where the team exhibits multi-modal policies. To address these limitations, we propose Dspic, an efficient shared-policy algorithm grounded in the maximum entropy framework. Specifically, Dspic employs self-supervised learning to extract discriminative role embeddings for each agent. These embeddings guide a complete division of the observation space, providing a theoretical guarantee for the optimality of parameter sharing. Furthermore, to handle the increased observation complexity and diversity resulting from this division, Dspic incorporates a diffusion policy, enhancing the capacity to model complex action distributions while enabling efficient learning. Extensive experiments on MaMuJoCo, SMAC, SMACv2, and LBF demonstrate that Dspic achieves superior sample efficiency while maintaining asymptotic optimality.