ICML Poster Incremental Gradient Descent with Small Epoch Counts is Surprisingly Slow on Ill-Conditioned Problems

Poster

Incremental Gradient Descent with Small Epoch Counts is Surprisingly Slow on Ill-Conditioned Problems

Yujun Kim · Jaeyoung Cha · Chulhee Yun

West Exhibition Hall B2-B3 #W-1020

[ Abstract ] [ Lay Summary ]

[ Slides] [ OpenReview]

Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: Recent theoretical results demonstrate that the convergence rates of permutation-based SGD (e.g., random reshuffling SGD) are faster than uniform-sampling SGD; however, these studies focus mainly on the large epoch regime, where the number of epochs $K$ exceeds the condition number $\kappa$. In contrast, little is known when $K$ is smaller than $\kappa$, and it is still a challenging open question whether permutation-based SGD can converge faster in this small epoch regime (Safran and Shamir, 2021). As a step toward understanding this gap, we study the naive deterministic variant, Incremental Gradient Descent (IGD), on smooth and strongly convex functions. Our lower bounds reveal that for the small epoch regime, IGD can exhibit surprisingly slow convergence even when all component functions are strongly convex. Furthermore, when some component functions are allowed to be nonconvex, we prove that the optimality gap of IGD can be significantly worse throughout the small epoch regime. Our analyses reveal that the convergence properties of permutation-based SGD in the small epoch regime may vary drastically depending on the assumptions on component functions. Lastly, we supplement the paper with tight upper and lower bounds for IGD in the large epoch regime.

Lay Summary:

How quickly do practical optimization algorithms learn the solution when the training time is limited? Many machine learning models are trained using a method called stochastic gradient descent (SGD), which improves performance by gradually adjusting model parameters. A practical variation called permutation-based SGD, which processes training data in a shuffled order, is known to work faster than the uniform-sampling SGD, which selects a random sample at each step. However, these benefits typically appear only when the training time is sufficiently long.We ask what happens when training is short, which corresponds to the common case where the computational budget is limited. To explore this, we study a simple instance of permutation-based SGD called Incremental Gradient Descent, which repeatedly processes the data in the natural order given by the dataset. We found that in short training scenarios, it can actually be much slower than expected, even for simple problems. As the complexity of the problem increases, performance can degrade further.These results show that training methods that work well in long training may behave very differently depending on the problem difficulty when time is limited. This has important implications connecting the theory and practice when the computational budget is limited.

Chat is not available.