Poster
in
Workshop: High-dimensional Learning Dynamics Workshop: The Emergence of Structure and Reasoning
Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution
Naoki Yoshida · Shogo Nakakita · Masaaki Imaizumi
We consider the stochastic gradient descent (SGD) algorithm for the non-convex optimization with modifications on the learning rate, and then prove its global convergence. SGD utilizes mini-batch sampling for calculating gradients and is a widely used optimization algorithm in machine learning. Despite its usefulness, it has been an open issue to reveal where the parameters optimized by SGD converge to. For answering this question, it has been necessary in previous researches to add a Gaussian noise to the gradient calculated in SGD, or assume that the random noise on the gradient induced from the mini-batch sampling is isotropic. However, adding a Gaussian noise to the gradient is not done practically, and the isotropic noise assumption contradicts many empirical findings. In this study, we incorporate modifications to the learning rate and momentum coefficient of SGD without an additive Gaussian noise or isotropic gradient noise assumption. We call our method \textit{Poisson SGD}, and show that the distribution of trained parameters by Poisson SGD converge to a unique probability distribution regardless of its randomness. Consequently, we show that Poisson SGD finds global optima of non-convex target functions by adjusting its temperature under some ideal situations. Furthermore, we derive an upper bound of the generalization error with the training by Poisson SGD. As a proof technique, we compare Poisson SGD with Bouncy Particle Sampler (BPS), and show its convergence using the theoretical advance of the piece-wise deterministic Markov process.