Noise as a Natural Regularizer in Markov Decision Processes: Connecting Environmental Stochasticity and Policy Simplicity
Abstract
The planning horizon in a Markov Decision Process (MDP) determines how far into the future an agent reasons. In practice, shorter horizons are commonly associated with policies that exhibit simpler or more interpretable decision-making behavior. In this paper, we establish a formal connection between environmental stochasticity and planning horizon in MDPs. We show that for broad classes of transition noise, solving a noisy MDP can be formally related to solving a noise-free MDP with a shorter effective discount factor, leading to identical optimal policies in some cases and near-optimal ones in others. We further characterize settings in which this correspondence breaks down, clarifying when horizon-based interpretations of noise are not valid. These results, which are supported by both theory and experiments, also give some insight into the common practice of using smaller discount factors for reinforcement learning than what can be justified by typical, grounded interpretations of a discount factor, such as inflation or the probability of catastrophic failure.