### Oral

## Oral 5B Optimization 2

##### Hall A1

**On the Last-Iterate Convergence of Shuffling Gradient Methods**

Zijian Liu · Zhengyuan Zhou

Shuffling gradient methods are widely used in modern machine learning tasks and include three popular implementations: Random Reshuffle (RR), Shuffle Once (SO), and Incremental Gradient (IG). Compared to the empirical success, the theoretical guarantee of shuffling gradient methods was not well-understood for a long time. Until recently, the convergence rates had just been established for the average iterate for convex functions and the last iterate for strongly convex problems (using squared distance as the metric). However, when using the function value gap as the convergence criterion, existing theories cannot interpret the good performance of the last iterate in different settings (e.g., constrained optimization). To bridge this gap between practice and theory, we prove the first last-iterate convergence rates for shuffling gradient methods with respect to the objective value even without strong convexity. Our new results either (nearly) match the existing last-iterate lower bounds or are as fast as the previous best upper bounds for the average iterate.

**Multiplicative Weights Update, Area Convexity and Random Coordinate Descent for Densest Subgraph Problems**

Ta Duy Nguyen · Alina Ene

We study the densest subgraph problem and give algorithms via multiplicative weights update and area convexity that converge in $O\left(\frac{\log m}{\epsilon^{2}}\right)$ and $O\left(\frac{\log m}{\epsilon}\right)$ iterations, respectively, both with nearly-linear time per iteration. Compared with the work by Bahmani et al. (2014), our MWU algorithm uses a very different and much simpler procedure for recovering the dense subgraph from the fractional solution and does not employ a binary search. Compared with the work by Boob et al. (2019), our algorithm via area convexity improves the iteration complexity by a factor $\Delta$---the maximum degree in the graph, and matches the fastest theoretical runtime currently known via flows (Chekuri et al., 2022) in total time. Next, we study the dense subgraph decomposition problem and give the first practical iterative algorithm with linear convergence rate $O\left(mn\log\frac{1}{\epsilon}\right)$ via accelerated random coordinate descent. This significantly improves over $O\left(\frac{m\sqrt{mn\Delta}}{\epsilon}\right)$ time of the FISTA-based algorithm by Harb et al. (2022). In the high precision regime $\epsilon\ll\frac{1}{n}$ where we can even recover the exact solution, our algorithm has a total runtime of $O\left(mn\log n\right)$, matching the state of the art exact algorithm via parametric flows (Gallo et al., 1989). Empirically, we show that this algorithm is very practical and scales to very large graphs, and its performance is competitive with widely used methods that have significantly weaker theoretical guarantees.

**High-Probability Convergence for Composite and Distributed Stochastic Minimization and Variational Inequalities with Heavy-Tailed Noise**

Eduard Gorbunov · Abdurakhmon Sadiev · Marina Danilova · Samuel Horváth · Gauthier Gidel · Pavel Dvurechenskii · Alexander Gasnikov · Peter Richtarik

High-probability analysis of stochastic first-order optimization methods under mild assumptions on the noise has been gaining a lot of attention in recent years. Typically, gradient clipping is one of the key algorithmic ingredients to derive good high-probability guarantees when the noise is heavy-tailed. However, if implemented naively, clipping can spoil the convergence of the popular methods for composite and distributed optimization (Prox-SGD/Parallel SGD) even in the absence of any noise. Due to this reason, many works on high-probability analysis consider only unconstrained non-distributed problems, and the existing results for composite/distributed problems do not include some important special cases (like strongly convex problems) and are not optimal. To address this issue, we propose new stochastic methods for composite and distributed optimization based on the clipping of stochastic gradient differences and prove tight high-probability convergence results (including nearly optimal ones) for the new methods. In addition, we also develop new methods for composite and distributed variational inequalities and analyze the high-probability convergence of these methods.

**Information Complexity of Stochastic Convex Optimization: Applications to Generalization, Memorization, and Tracing**

Idan Attias · Gintare Karolina Dziugaite · Mahdi Haghifam · Roi Livni · Daniel Roy

In this work, we investigate the interplay between memorization and learning in the context of *stochastic convex optimization* (SCO). We define memorization via the information a learning algorithm reveals about its training data points. We then quantify this information using the framework of conditional mutual information (CMI) proposed by Steinke and Zakynthinou (2020). Our main result is a precise characterization of the tradeoff between the accuracy of a learning algorithm and its CMI, answering an open question posed by Livni (2023). We show that, in the $L^2$ Lipschitz--bounded setting and under strong convexity, every learner with an excess error $\epsilon$ has CMI bounded below by $\Omega(1/\epsilon^2)$ and $\Omega(1/\epsilon)$, respectively. We further demonstrate the essential role of memorization in learning problems in SCO by designing an adversary capable of accurately identifying a significant fraction of the training samples in specific SCO problems. Finally, we enumerate several implications of our results, such as a limitation of generalization bounds based on CMI and the incompressibility of samples in SCO problems.