Timezone: »

Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound
Lin Yang · Mengdi Wang

Thu Jul 16 07:00 AM -- 07:45 AM & Thu Jul 16 06:00 PM -- 06:45 PM (PDT) @
Exploration in reinforcement learning (RL) suffers from the curse of dimensionality when the state-action space is large. A common practice is to parameterize the high-dimensional value and policy functions using given features. However existing methods either have no theoretical guarantee or suffer a regret that is exponential in the planning horizon $H$.In this paper, we propose an online RL algorithm, namely the MatrixRL, that leverages ideas from linear bandit to learn a low-dimensional representation of the probability transition model while carefully balancing the exploitation-exploration tradeoff. We show that MatrixRL achieves a regret bound ${O}\big(H^2d\log T\sqrt{T}\big)$ where $d$ is the number of features, independent with the number of state-action pairs. MatrixRL has an equivalent kernelized version, which is able to work with an arbitrary kernel Hilbert space without using explicit features. In this case, the kernelized MatrixRL satisfies a regret bound ${O}\big(H^2\wt{d}\log T\sqrt{T}\big)$, where $\wt{d}$ is the effective dimension of the kernel space.

Author Information

Lin Yang (UCLA)
Mengdi Wang (Princeton University)

More from the Same Authors