Conditional Random Fields for Multi-agent Reinforcement Learning

Conditional Random Fields for Multi-agent Reinforcement Learning
Xinhua Zhang - CSL, RSISE, Australian National University, and SML NICTA, Australia Douglas Aberdeen - NICTA, Australian National University, Australia S.V.N. Vishwanathan - SML NICTA, and CSL, RSISE, Australian National University, Australia
Conditional random fields (CRFs) are graphical models for modeling the probability of labels given the observations. They have traditionally been trained with using a set of observation and label pairs. Underlying all CRFs is the assumption that, conditioned on the training data, the labels are independent and identically distributed (iid). In this paper we explore the use of CRFs in a class of temporal learning algorithms, namely policygradient reinforcement learning (RL). Now the labels are no longer iid. They are actions that update the environment and affect the next observation. From an RL point of view, CRFs provide a natural way to model joint actions in a decentralized Markov decision process. They define how agents can communicate with each other to choose the optimal joint action. Our experiments include a synthetic network alignment problem, a distributed sensor network, and road traffc control; clearly outperforming RL methods which do not model the proper joint policy.

Xinhua Zhang - CSL, RSISE, Australian National University, and SML NICTA, Australia
Douglas Aberdeen - NICTA, Australian National University, Australia
S.V.N. Vishwanathan - SML NICTA, and CSL, RSISE, Australian National University, Australia

Conditional random fields (CRFs) are graphical models for modeling the probability of labels given the observations. They have traditionally been trained with using a set of observation and label pairs. Underlying all CRFs is the assumption that, conditioned on the training data, the labels are independent and identically distributed (iid). In this paper we explore the use of CRFs in a class of temporal learning algorithms, namely policygradient reinforcement learning (RL). Now the labels are no longer iid. They are actions that update the environment and affect the next observation. From an RL point of view, CRFs provide a natural way to model joint actions in a decentralized Markov decision process. They define how agents can communicate with each other to choose the optimal joint action. Our experiments include a synthetic network alignment problem, a distributed sensor network, and road traffc control; clearly outperforming RL methods which do not model the proper joint policy.