Timezone: »
We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge in the form of reward machines is available to the learner. Specifically, we investigate the efficiency of RL under the average-reward criterion, in the regret minimization setting. We propose two model-based RL algorithms that each exploits the structure of the reward machines, and show that our algorithms achieve regret bounds that improve over those of baselines by a multiplicative factor proportional to the number of states in the underlying reward machine. To the best of our knowledge, the proposed algorithms and associated regret bounds are the first to tailor the analysis specifically to reward machines, either in the episodic or average-reward settings. We also present a regret lower bound for the studied setting, which indicates that the proposed algorithms achieve a near-optimal regret. Finally, we report numerical experiments that demonstrate the superiority of the proposed algorithms over existing baselines in practice.
Author Information
Hippolyte Bourel (University of Copenhagen)
Anders Jonsson (Universitat Pompeu Fabra)
Odalric-Ambrym Maillard (Inria Lille - Nord Europe)
Mohammad Sadegh Talebi (University of Copenhagen)
More from the Same Authors
-
2021 : Collision Resolution in Multi-player Bandits Without Observing Collision Information »
Eleni Nisioti · Nikolaos Thomos · Boris Bellalta · Anders Jonsson -
2021 Poster: Optimal Thompson Sampling strategies for support-aware CVaR bandits »
Dorian Baudry · Romain Gautron · Emilie Kaufmann · Odalric-Ambrym Maillard -
2021 Spotlight: Optimal Thompson Sampling strategies for support-aware CVaR bandits »
Dorian Baudry · Romain Gautron · Emilie Kaufmann · Odalric-Ambrym Maillard -
2021 Poster: Fast active learning for pure exploration in reinforcement learning »
Pierre Menard · Omar Darwiche Domingues · Anders Jonsson · Emilie Kaufmann · Edouard Leurent · Michal Valko -
2021 Spotlight: Fast active learning for pure exploration in reinforcement learning »
Pierre Menard · Omar Darwiche Domingues · Anders Jonsson · Emilie Kaufmann · Edouard Leurent · Michal Valko -
2020 Poster: Restarted Bayesian Online Change-point Detector achieves Optimal Detection Delay »
REDA ALAMI · Odalric-Ambrym Maillard · Raphaël Féraud -
2020 Poster: Tightening Exploration in Upper Confidence Reinforcement Learning »
Hippolyte Bourel · Odalric-Ambrym Maillard · Mohammad Sadegh Talebi