Timezone: »

Trajectory Diversity for Zero-Shot Coordination
Andrei Lupu · Brandon Cui · Hengyuan Hu · Jakob Foerster

Tue Jul 20 05:40 PM -- 05:45 PM (PDT) @ None

We study the problem of zero-shot coordination (ZSC), where agents must independently produce strategies for a collaborative game that are compatible with novel partners not seen during training. Our first contribution is to consider the need for diversity in generating such agents. Because self-play (SP) agents control their own trajectory distribution during training, each policy typically only performs well on this exact distribution. As a result, they achieve low scores in ZSC, since playing with another agent is likely to put them in situations they have not encountered during training. To address this issue, we train a common best response (BR) to a population of agents, which we regulate to be diverse. To this end, we introduce \textit{Trajectory Diversity} (TrajeDi) -- a differentiable objective for generating diverse reinforcement learning policies. We derive TrajeDi as a generalization of the Jensen-Shannon divergence between policies and motivate it experimentally in two simple settings. We then focus on the collaborative card game Hanabi, demonstrating the scalability of our method and improving upon the cross-play scores of both independently trained SP agents and BRs to unregularized populations.

Author Information

Andrei Lupu (Mila, McGill University)
Brandon Cui (Facebook AI Research)
Hengyuan Hu (Facebook AI Research)
Jakob Foerster (Facebook AI Research)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors