Timezone: »

Topological Experience Replay for Fast Q-Learning
Zhang-Wei Hong · Tao Chen · Yen-Chen Lin · Joni Pajarinen · Pulkit Agrawal

State-of-the-art deep Q-learning methods update Q-values using state transition tuples sampled from the experience replay buffer. Often this strategy is to randomly sample or prioritize data sampling based on measures such as the temporal difference (TD) error. Such sampling strategies are agnostic to the structure of the Markov decision process (MDP) and can therefore be data inefficient at propagating reward signals from goal states to the initial state. To accelerate reward propagation, we make use of the MDP structure by organizing the agent's experience into a graph. Each edge in the graph represents a transition between two connected states. We perform value backups via a breadth-first search that expands vertices in the graph starting from the set of terminal states successively moving backward. We empirically show that our method is substantially more data-efficient than several baselines on a diverse range of sparse reward tasks. Notably, the proposed method also outperforms baselines that have the advantage of a much larger computational budget.

Author Information

Zhang-Wei Hong (MIT)
Tao Chen (Massachusetts Institute of Technology)
Yen-Chen Lin (MIT)
Joni Pajarinen (Aalto University)
Pulkit Agrawal (MIT)

More from the Same Authors