Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Foundations of Reinforcement Learning and Control: Connections and Perspectives

Optimality of Stationary Policies in Risk-averse Total-reward MDPs with EVaR

Xihong Su · Marek Petrik · Julien Grand-ClĂ©ment


Abstract:

The risk-neutral discounted objective is popular in reinforcement learning, in part due to existence of stationary optimal policies and convenient analysis based on contracting Bellman operators. Unfortunately, for some common risk-averse discounted objectives, such as Value at Risk (VaR) and Conditional Value at Risk (CVaR), optimal policies must be history-dependent and must be computed using complex state augmentation. In this paper, we show that the risk-averse total reward objective, under the Entropic Risk Measure (ERM) and Entropic Value at Risk (EVaR), can be optimized by a stationary policy, an important property for practical implementations. In addition, an optimal policy can be efficiently computed using value iteration, policy iteration, and even linear programming. Importantly, our results only require the relatively mild condition of transient MDPs, and allow for both positive and negative rewards, unlike prior work requiring assumptions on the sign of the rewards. Overall, our results suggest that the total reward criterion may be preferable to the discounted criterion in a broad range of risk-averse reinforcement learning problems.

Chat is not available.