Track: Bandits and Multiagent Learning

Thu 13 June 16:00 - 16:20 PDT

Decentralized Exploration in Multi-Armed Bandits

Raphaël Féraud · REDA ALAMI · Romain Laroche

We consider the decentralized exploration problem: a set of players collaborate to identify the best arm by asynchronously interacting with the same stochastic environment. The objective is to insure privacy in the best arm identification problem between asynchronous, collaborative, and thrifty players. In the context of a digital service, we advocate that this decentralized approach allows a good balance between conflicting interests: the providers optimize their services, while protecting privacy of users and saving resources. We define the privacy level as the amount of information an adversary could infer by intercepting all the messages concerning a single user. We provide a generic algorithm DECENTRALIZED ELIMINATION, which uses any best arm identification algorithm as a subroutine. We prove that this algorithm insures privacy, with a low communication cost, and that in comparison to the lower bound of the best arm identification problem, its sample complexity suffers from a penalty depending on the inverse of the probability of the most frequent players. Then, thanks to the genericity of the approach, we extend the proposed algorithm to the non-stationary bandits. Finally, experiments illustrate and complete the analysis.

Thu 13 June 16:20 - 16:25 PDT

Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback

Chicheng Zhang · Alekh Agarwal · Hal Daumé III · John Langford · Sahand Negahban

We investigate the feasibility of learning from both fully-labeled supervised data and contextual bandit data. We specifically consider settings in which the underlying learning signal may be different between these two data sources. Theoretically, we state and prove no-regret algorithms for learning that is robust to divergences between the two sources. Empirically, we evaluate some of these algorithms on a large selection of datasets, showing that our approaches are feasible, and helpful in practice.

Thu 13 June 16:25 - 16:30 PDT

Exploiting structure of uncertainty for efficient matroid semi-bandits

Pierre Perrault · Vianney Perchet · Michal Valko

We improve the efficiency of algorithms for stochastic \emph{combinatorial semi-bandits}. In most interesting problems, state-of-the-art algorithms take advantage of structural properties of rewards, such as \emph{independence}. However, while being minimax optimal in terms of regret, these algorithms are intractable. In our paper, we first reduce their implementation to a specific \emph{submodular maximization}. Then, in case of \emph{matroid} constraints, we design adapted approximation routines, thereby providing the first efficient algorithms that exploit reward structure. In particular, we improve the state-of-the-art efficient gap-free regret bound by a factor $\sqrt{k}$, where $k$ is the maximum action size. Finally, we show how our improvement translates to more general \emph{budgeted combinatorial semi-bandits}.

Thu 13 June 16:30 - 16:35 PDT

PAC Identification of Many Good Arms in Stochastic Multi-Armed Bandits

Arghya Roy Chaudhuri · Shivaram Kalyanakrishnan

We consider the problem of identifying any k out of the best m arms in an n-armed stochastic multi-armed bandit. Framed in the PAC setting, this particular problem generalises both the problem of best subset selection'' (Kalyanakrishnan & Stone, 2010) and that of selectingone out of the best m'' arms (Roy Chaudhuri & Kalyanakrishnan, 2017). In applications such as crowd-sourcing and drug-designing, identifying a single good solution is often not sufficient. Moreover, finding the best subset might be hard due to the presence of many indistinguishably close solutions. Our generalisation of identifying exactly k arms out of the best m, where 1 ≤ k ≤ m, serves as a more effective alternative. We present a lower bound on the worst-case sample complexity for general k, and a fully sequential PAC algorithm, LUCB-k-m, which is more sample-efficient on easy instances. Also, extending our analysis to infinite-armed bandits, we present a PAC algorithm that is independent of n, which identifies an arm from the best ρ fraction of arms using at most an additive poly-log number of samples than compared to the lower bound, thereby improving over (Roy Chaudhuri & Kalyanakrishnan, 2017) and Aziz et al. (2018). The problem of identifying k > 1 distinct arms from the best ρ fraction is not always well-defined; for a special class of this problem, we present lower and upper bounds. Finally, through a reduction, we establish a relation between upper bounds for the one out of the best ρ'' problem for infinite instances and theone out of the best m'' problem for finite instances. We conjecture that it is more efficient to solve ``small" finite instances using the latter formulation, rather than going through the former.

Thu 13 June 16:35 - 16:40 PDT

Contextual Multi-armed Bandit Algorithm for Semiparametric Reward Model

Gi-Soo Kim · Myunghee Cho Paik

Contextual multi-armed bandit (MAB) algorithms have been shown promising for maximizing cumulative rewards in sequential decision tasks such as news article recommendation systems, web page ad placement algorithms, and mobile health. However, most of the proposed contextual MAB algorithms assume linear relationships between the reward and the context of the action. This paper proposes a new contextual MAB algorithm for a relaxed, semiparametric reward model that supports nonstationarity. The proposed method is less restrictive, easier to implement and faster than two alternative algorithms that consider the same model, while achieving a tight regret upper bound. We prove that the high-probability upper bound of the regret incurred by the proposed algorithm has the same order as the Thompson sampling algorithm for linear reward models. The proposed and existing algorithms are evaluated via simulation and also applied to Yahoo! news article recommendation log data.

Thu 13 June 16:40 - 17:00 PDT

Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning

Jakob Foerster · Francis Song · Edward Hughes · Neil Burch · Iain Dunning · Shimon Whiteson · Matthew Botvinick · Michael Bowling

When observing the actions of others, humans make inferences about why the others acted as they did, and what this implies about their view of the world. Humans also use the fact that their actions will be interpreted in this manner when observed by others, allowing them to act informatively and thereby communicate efficiently with others. Although learning algorithms have recently achieved superhuman performance in a number of two-player, zero-sum games, scalable multi-agent reinforcement learning algorithms that can discover effective strategies and conventions in complex, partially observable settings have proven elusive. We present the \emph{Bayesian action decoder} (BAD), a new multi-agent learning method that uses an approximate Bayesian update to obtain a public belief that conditions on the actions taken by all agents in the environment. Together with the public belief, this Bayesian update effectively defines a new Markov decision process, the \emph{public belief MDP}, in which the action space consists of deterministic partial policies, parameterised by neural networks, that can be sampled for a given public state. BAD exploits the fact that an agent acting only on this public belief state can still learn to use its private information if the action space is augmented to be over partial policies mapping private information into environment actions. The Bayesian update is also closely related to the \emph{theory of mind} reasoning that humans carry out when observing others' actions. We first validate BAD on a proof-of-principle two-step matrix game, where it outperforms policy gradient methods. We then evaluate BAD on the challenging, cooperative partial-information card game Hanabi, where in the two-player setting the method surpasses all previously published learning and hand-coded approaches.

Thu 13 June 17:00 - 17:05 PDT

TarMAC: Targeted Multi-Agent Communication

Abhishek Das · Theophile Gervet · Joshua Romoff · Dhruv Batra · Devi Parikh · Michael Rabbat · Joelle Pineau

We propose a targeted communication architecture for multi-agent reinforcement learning, where agents learn both \emph{what} messages to send and \emph{whom} to address them to while performing cooperative tasks in partially-observable environments. This targeting behavior is learnt solely from downstream task-specific reward without any communication supervision. We additionally augment this with a multi-round communication approach where agents coordinate via multiple rounds of communication before taking actions in the environment.

We evaluate our approach on a diverse set of cooperative multi-agent tasks, of varying difficulties, with varying number of agents, in a variety of environments ranging from 2D grid layouts of shapes and simulated traffic junctions to 3D indoor environments, and demonstrate the benefits of targeted and multi-round communication. Moreover, we show that the targeted communication strategies learned by agents are interpretable and intuitive.

Finally, we show that our architecture can be easily extended to mixed and competitive environments, leading to improved performance and sample complexity over recent state-of-the-art approaches.

Thu 13 June 17:05 - 17:10 PDT

QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning

Kyunghwan Son · Daewoo Kim · Wan Ju Kang · David Earl Hostallero · Yung Yi

We explore value-based solutions for multi-agent reinforcement learning (MARL) tasks in the centralized training with decentralized execution (CTDE) regime popularized recently. However, VDN and QMIX are representative examples that use the idea of factorization of the joint action-value function into individual ones for decentralized execution. VDN and QMIX address only a fraction of factorizable MARL tasks due to their structural constraint in factorization such as additivity and monotonicity. In this paper, we propose a new factorization method for MARL, QTRAN, which is free from such structural constraints and takes on a new approach to transforming the original joint action-value function into an easily factorizable one, with the same optimal actions. QTRAN guarantees successful factorization of any factorizable task, thus covering a much wider class of MARL tasks than does VDN or QMIX. Our experiments for the tasks of multi-domain Gaussian-squeeze and modified predator-prey demonstrate QTRAN's superior performance with especially larger margins in games whose payoffs penalize non-cooperative behavior more aggressively.

Thu 13 June 17:10 - 17:15 PDT

Actor-Attention-Critic for Multi-Agent Reinforcement Learning

Shariq Iqbal · Fei Sha

Reinforcement learning in multi-agent scenarios is important for real-world applications but presents challenges beyond those seen in single-agent settings. We present an actor-critic algorithm that trains decentralized policies in multi-agent settings, using centrally computed critics that share an attention mechanism which selects relevant information for each agent at every timestep. This attention mechanism enables more effective and scalable learning in complex multi-agent environments, when compared to recent approaches. Our approach is applicable not only to cooperative settings with shared rewards, but also individualized reward settings, including adversarial settings, as well as settings that do not provide global states, and it makes no assumptions about the action spaces of the agents. As such, it is flexible enough to be applied to most multi-agent learning problems.

Thu 13 June 17:15 - 17:20 PDT

Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation on Multi-Agent Reinforcement Learning

Thinh Doan · Siva Maguluri · Justin Romberg

We study the policy evaluation problem in multi-agent reinforcement learning. In this problem, a group of agents work cooperatively to evaluate the value function for the global discounted accumulative reward problem, which is composed of local rewards observed by the agents. Over a series of time steps, the agents act, get rewarded, update their local estimate of the value function, then communicate with their neighbors. The local update at each agent can be interpreted as a distributed consensus-based variant of the popular temporal difference learning algorithm TD(0).

While distributed reinforcement learning algorithms have been presented in the literature, almost nothing is known about their convergence rate. Our main contribution is providing a finite-time analysis for the convergence of the distributed TD(0) algorithm. We do this when the communication network between the agents is time-varying in general. We obtain an explicit upper bound on the rate of convergence of this algorithm as a function of the network topology and the discount factor. Our results mirror what we would expect from using distributed stochastic gradient descent for solving convex optimization problems.

Main Navigation

Session

Bandits and Multiagent Learning