Track: Multi-Agent Learning 1

Thu 12 July 2:00 - 2:20 PDT

Learning Policy Representations in Multiagent Systems

Aditya Grover · Maruan Al-Shedivat · Jayesh K. Gupta · Yura Burda · Harrison Edwards

Modeling agent behavior is central to understanding the emergence of complex phenomena in multiagent systems. Prior work in agent modeling has largely been task-specific and driven by hand-engineering domain-specific prior knowledge. We propose a general learning framework for modeling agent behavior in any multiagent system using only a handful of interaction data. Our framework casts agent modeling as a representation learning problem. Consequently, we construct a novel objective inspired by imitation learning and agent identification and design an algorithm for unsupervised learning of representations of agent policies. We demonstrate empirically the utility of the proposed framework in (i) a challenging high-dimensional competitive environment for continuous control and (ii) a cooperative environment for communication, on supervised predictive tasks, unsupervised clustering, and policy optimization using deep reinforcement learning.

Thu 12 July 2:20 - 2:30 PDT

Learning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems

Eugenio Bargiacchi · Timothy Verstraeten · Diederik Roijers · Ann Nowé · Hado van Hasselt

Learning to coordinate between multiple agents is an important problem in many reinforcement learning problems. Key to learning to coordinate is exploiting loose couplings, i.e., conditional independences between agents. In this paper we study learning in repeated fully cooperative games, multi-agent multi-armed bandits (MAMABs), in which the expected rewards can be expressed as a coordination graph. We propose multi-agent upper confidence exploration (MAUCE), a new algorithm for MAMABs that exploits loose couplings, which enables us to prove a regret bound that is logarithmic in the number of arm pulls and only linear in the number of agents. We empirically compare MAUCE to sparse cooperative Q-learning, and a state-of-the-art combinatorial bandit approach, and show that it performs much better on a variety of settings, including learning control policies for wind farms.

Thu 12 July 2:30 - 2:40 PDT

Learning to Act in Decentralized Partially Observable MDPs

Jilles Dibangoye · Olivier Buffet

We address a long-standing open problem of reinforcement learning in decentralized partially observable Markov decision processes. Previous attempts focussed on different forms of generalized policy iteration, which at best led to local optima. In this paper, we restrict attention to plans, which are simpler to store and update than policies. We derive, under certain conditions, the first near-optimal cooperative multi-agent reinforcement learning algorithm. To achieve significant scalability gains, we replace the greedy maximization by mixed-integer linear programming. Experiments show our approach can learn to act near-optimally in many finite domains from the literature.

Thu 12 July 2:40 - 2:50 PDT

Modeling Others using Oneself in Multi-Agent Reinforcement Learning

Roberta Raileanu · Emily Denton · Arthur Szlam · Facebook Rob Fergus

We consider the multi-agent reinforcement learningsetting with imperfect information. The rewardfunction depends on the hidden goals ofboth agents, so the agents must infer the otherplayers‚Äô goals from their observed behavior inorder to maximize their returns. We propose anew approach for learning in these domains: SelfOther-Modeling (SOM), in which an agent usesits own policy to predict the other agent‚Äôs actionsand update its belief of their hidden goal in an onlinemanner. We evaluate this approach on threedifferent tasks and show that the agents are ableto learn better policies using their estimate of theother players‚Äô goals, in both cooperative and competitivesettings.

Thu 12 July 2:50 - 3:00 PDT

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

Tabish Rashid · Mikayel Samvelyan · Christian Schroeder · Gregory Farquhar · Jakob Foerster · Shimon Whiteson

In many real-world settings, a team of agents must coordinate their behaviour while acting in a decentralised way. At the same time, it is often possible to train the agents in a centralised fashion in a simulated or laboratory setting, where global state information is available and communication constraints are lifted. Learning joint action-values conditioned on extra state information is an attractive way to exploit centralised learning, but the best strategy for then extracting decentralised policies is unclear. Our solution is QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion. QMIX employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations. We structurally enforce that the joint-action value is monotonic in the per-agent values, which allows tractable maximisation of the joint action-value in off-policy learning, and guarantees consistency between the centralised and decentralised policies. We evaluate QMIX on a challenging set of StarCraft II micromanagement tasks, and show that QMIX significantly outperforms existing value-based multi-agent reinforcement learning methods.

Main Navigation

Session

Multi-Agent Learning 1

Learning Policy Representations in Multiagent Systems

Learning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems

Learning to Act in Decentralized Partially Observable MDPs

Modeling Others using Oneself in Multi-Agent Reinforcement Learning

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning