Keywords: [ Online Learning / Bandits ] [ Reinforcement Learning Theory ] [ Reinforcement Learning - Theory ]
We consider the task of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves O(√L|X|AT ) regret with high probability, where L is the horizon, |X| the number of states, |A| the number of actions, and T the number of episodes. To our knowledge, our algorithm is the first to ensure O(√T) regret in this challenging setting; in fact, it achieves the same regret as (Rosenberg & Mansour, 2019a) who consider the easier setting with full-information. Our key contributions are two-fold: a tighter confidence set for the transition function; and an optimistic loss estimator that is inversely weighted by an "upper occupancy bound".