Comparing Deterministic and Soft Policy Gradients for Optimizing Gaussian Mixture Actors
Sheelabhadra Dey ⋅ Guni Sharon
Abstract
Gaussian Mixture Models (GMMs) have been recently proposed for approximating actors in actor-critic reinforcement learning algorithms. Such GMM-based actors are commonly optimized using stochastic policy gradients along with an entropy maximization objective. In contrast to previous work, we define and study deterministic policy gradients for optimizing GMM-based actors. Similar to stochastic gradient approaches, our proposed method, denoted $\textit{Gaussian Mixture Deterministic Policy Gradient}$ (Gamid-PG), encourages policy entropy maximization. To this end, we define the GMM entropy gradient using $\textit{Variational Approximation}$ of the $KL$-divergence between the GMM's constituting Gaussians. We compare Gamid-PG with common stochastic policy gradient methods on benchmark dense-reward MuJoCo tasks and sparse-reward Fetch tasks. We observe that Gamid-PG outperforms stochastic gradient-based methods in 3/6 MuJoCo tasks while performing similarly on the remaining 3 tasks. In the Fetch tasks, Gamid-PG outperforms single-actor deterministic gradient-based methods while performing worse than stochastic policy gradient methods. Consequently, we conclude that GMMs optimized using deterministic policy gradients (1) should be favorably considered over stochastic gradients in dense-reward continuous control tasks, and (2) improve upon single-actor deterministic gradients.
Successful Page Load