Timezone: »

Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition
Chi Jin · Tiancheng Jin · Haipeng Luo · Suvrit Sra · Tiancheng Yu

Tue Jul 14 08:00 AM -- 08:45 AM & Tue Jul 14 07:00 PM -- 07:45 PM (PDT) @

We consider the task of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves O(√L|X|AT ) regret with high probability, where L is the horizon, |X| the number of states, |A| the number of actions, and T the number of episodes. To our knowledge, our algorithm is the first to ensure O(√T) regret in this challenging setting; in fact, it achieves the same regret as (Rosenberg & Mansour, 2019a) who consider the easier setting with full-information. Our key contributions are two-fold: a tighter confidence set for the transition function; and an optimistic loss estimator that is inversely weighted by an "upper occupancy bound".

Author Information

Chi Jin (Princeton University)
Tiancheng Jin (University of Southern California)
Haipeng Luo (University of Southern California)
Suvrit Sra (MIT)
Tiancheng Yu (MIT)

I am quant researcher at Two Sigma. I completed my PhD in MIT EECS in 2023.

More from the Same Authors