Workshop
HiLD: Highdimensional Learning Dynamics Workshop
Courtney Paquette · Zhenyu Liao · Mihai Nica · Elliot Paquette · Andrew Saxe · Rene Vidal
Meeting Room 315
Modern applications of machine learning seek to extract insights from highdimensional datasets. The goal of the Highdimensional Learning Dynamics (HiLD) Workshop is to predict and analyze the dynamics of learning algorithms when the number of samples and parameters are large. This workshop seeks to spur research and collaboration around:
1. Developing analyzable models and dynamics to explain observed deep neural network phenomena;
2. Creating mathematical frameworks for scaling limits of neural network dynamics as width and depth grow, which often defy lowdimensional geometric intuitions;
3. The role of overparameterization and how this leads to conserved quantities in the dynamics and the emergence of geometric invariants, with links to Noether's theorem, etc;
4. Provable impacts of the choice of optimization algorithm, hyperparameters, and neural network architectures on training/test dynamics.
HiLD Workshop aims to bring together experts from classical random matrix theory, optimization, highdimensional statistics/probability, and statistical physics to share their perspectives while leveraging crossover experts in ML. It seeks to create synergies between these two groups which often do not interact. Through a series of talks, poster sessions, and panel discussions, the workshop will tackle questions on dynamics of learning algorithms at the interface of random matrix theory, highdimensional statistics, SDEs, and ML.
Schedule
Fri 12:00 p.m.  12:01 p.m.

Opening Remarks
(
Remarks
)
>
SlidesLive Video 
🔗 
Fri 12:00 p.m.  12:45 p.m.

Feature Learning in Twolayer Neural Networks under Structured Data, Murat A. Erdogdu
(
Plenary Speaker
)
>
SlidesLive Video Abstract: We study the effect of gradientbased optimization on feature learning in twolayer neural networks. We consider a setting where the number of samples is of the same order as the input dimension and show that, when the input data is isotropic, gradient descent always improves upon the initial random features model in terms of prediction risk, for a certain class of targets. Further leveraging the practical observation that data often contains additional structure, i.e., the input covariance has nontrivial alignment with the target, we prove that the class of learnable targets can be significantly extended, demonstrating a clear separation between kernel methods and twolayer neural networks in this regime. 
Murat Erdogdu 🔗 
Fri 12:45 p.m.  1:15 p.m.

Contributed talks 1
(
Contributed talks
)
>
SlidesLive Video Two (15 mins) contributed talks in this session.

MENGQI LOU · Zhichao Wang 🔗 
Fri 1:15 p.m.  2:15 p.m.

Poster Session/Coffee Break
(
Break
)
>

🔗 
Fri 2:15 p.m.  3:00 p.m.

Highdimensional Optimization in the Age of ChatGPT, Sanjeev Arora
(
Plenary Speaker
)
>
SlidesLive Video Abstract: The first half of the talk surveys recent results in analyzing optimization methods in deep learning, specifically how different update methods affect the quality of solution produced. This study of "implicit bias of training" has led to some quantification of the effect of normalization, weight decay, learning rates in diverse architectures, as well as of the efficacy of local updates in a distributed environment ("Local SGD"). The second half of the talk surveys the nascent field of applying optimization ideas to large AI models. The nature and size of these models leads to new phenomena which motivate new research directions especially related to finetuning and incontext learning. Recent results highlight the power of a forward pass of today's large models. The first result shows that in context of finetuning, zero'th order optimization (doable with forward pass only) can be competitive with usual 1st order optimization. The second result shows that incontext learning (i.e., learning during a single forward pass) can be more powerful than hitherto believed, since it allows the LLM's forward pass to even finetune a fairly large "baby transformer" hidden inside. 
Sanjeev Arora 🔗 
Fri 3:00 p.m.  4:30 p.m.

Lunch
(
Break
)
>

🔗 
Fri 4:30 p.m.  5:15 p.m.

Multilevel theory of neural representations: Capacity of neural manifolds in biological and artificial neural networks, SueYeon Chung
(
Plenary Speaker
)
>
SlidesLive Video Abstract: A central goal in neuroscience is to understand how orchestrated computations in the brain arise from the properties of single neurons and networks of such neurons. Answering this question requires theoretical advances that shine a light on the ‘black box’ of representations in neural circuits. In this talk, we will demonstrate theoretical approaches that help describe how cognitive task implementations emerge from the structure in neural populations and from biologically plausible neural networks. We will introduce a new statistical mechanical theory that connects geometric structures that arise from neural responses (i.e., neural manifolds) to the neural representation’s efficiency in implementing a task. In particular, this theory describes how many neural manifolds can be represented (or ‘packed’) in the neural activity space while they can be linearly decoded by a downstream readout neuron. The intuition from this theory is remarkably simple: like a sphere packing problem in physical space, we can encode many “neural manifolds” into the neural activity space if these manifolds are small and lowdimensional, and vice versa. Next, we will describe how such an approach can, in fact, open the ‘black box’ of distributed neuronal circuits in a range of settings, such as experimental neural datasets and artificial neural networks. In particular, our method overcomes the limitations of traditional dimensionality reduction techniques, as it operates directly on the highdimensional representations rather than relying on lowdimensionality assumptions for visualization. Furthermore, this method allows for simultaneous multilevel analysis, by measuring geometric properties in neural population data and estimating the amount of task information embedded in the same population. This geometric approach is general and can be used across different brain areas and task modalities, as demonstrated in our works and others. Finally, we will discuss our recent efforts to fully extend this multilevel description of neural populations by (1) investigating how single neuron properties shape the representation geometry in early sensory areas and by (2) understanding how taskimplementing neural manifolds emerge in downstream areas in biological and artificial networks. By extending our mathematical toolkit for analyzing representations underlying complex neuronal networks, we hope to contribute to the longterm challenge of understanding the neuronal basis of tasks and behaviors. 
SueYeon Chung 🔗 
Fri 5:15 p.m.  6:00 p.m.

Contributed talks 2
(
Contributed talks
)
>
SlidesLive Video Three (15 mins) contributed talks in this session.

Simon Du · Wei Huang · Yuandong Tian 🔗 
Fri 6:00 p.m.  6:30 p.m.

Coffee Break
(
Break
)
>

🔗 
Fri 6:30 p.m.  7:15 p.m.

A strong implicit bias in SGD dynamics towards much simpler subnetworks through stochastic collapse to invariant sets, Surya Ganguli
(
Plenary Speaker
)
>
link
SlidesLive Video Abstract: We reveal a strong implicit bias of stochastic gradient descent (SGD) that drives initially expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization. To reveal this bias, we identify invariant sets, or subsets of parameter space that trap SGD dynamics once entered. We further establish sufficient conditions for stochastic attractivity to these simpler invariant sets based on a competition between the loss landscape's curvature around the invariant set and the noise introduced by stochastic gradients. Remarkably, we find that an increased level of SGD noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddlepoints or local maxima of the train loss. We further demonstrate how this simplifying process of stochastic collapse benefits generalization in a linear teacherstudent framework. Our analysis also mechanistically explains why early training with large learning rates for extended periods benefits subsequent generalization by promoting stochastic collapse. Finally we empirically demonstrate the strong effect of stochastic collapse in benchmark architectures and datasets, revealing surprisingly large groups of redundant neurons with identical incoming and outgoing weights after training, due to attractive invariant sets associated with permutation symmetry. Joint with with Feng Chen, Daniel Kunin and Atushi Yamamura* (* equal authors) 
Surya Ganguli 🔗 
Fri 7:15 p.m.  8:00 p.m.

Solving overparametrized systems of random equations, Andrea Montanari
(
Plenary Speaker
)
>
SlidesLive Video Abstract: Modern machine learning models are often overparametrized: they are rich enough to perfectly interpolate the training data, even if these are pure noise. The optimization landscape of such problems is still poorly understood: while it is expected that sufficiently overparametrized systems are easy to optimize, it has proven challenging to formalize this intuition. I will introduce a simple toy model for these optimization landscapes, and present characterizations of several algorithms for this model. I will then discuss differences and analogies between this model and more realistic neural networks. 
Andrea Montanari 🔗 
Fri 7:59 p.m.  8:00 p.m.

Closing remarks
(
Remarks
)
>

🔗 


Learning to Plan in Multidimensional Stochastic Differential Equations
(
Poster
)
>
While planning in fully certain dynamic environments modeled by differential equations is wellstudied, little is known about learning to plan under uncertainty. This work focuses on fast and provably efficient learning algorithms for unknown multidimensional stochastic differential equations and comprehensively studies the interplay between learning and planning. We present computationally fast algorithms that learn unknown dynamics by planning according to suitablyrandomized parameter estimates. We prove efficiency by showing that the errors of both learning and planning decay as squareroot of time and grow linearly with the number of unknown parameters. To obtain the results, novel theoretical analyses are developed for eigenvalues of perturbed matrices, for bounding ratios of stochastic integrals, and for lowerbounding singular values of (partially) random matrices. The authors expect the framework to be applicable in other relevant problems. 
Mohamad Sadegh Shirani Faradonbeh · Mohamad Kazem Shirani Faradonbeh 🔗 


Elephant Neural Networks: Born to Be a Continual Learner
(
Poster
)
>
Catastrophic forgetting remains a great challenge to continual learning for decades.While recent works have proposed effective methods to mitigate this problem, they mainly focus on the algorithmic side.Meanwhile, we do not fully understand what architectural properties of neural networks lead to catastrophic forgetting.This study aims to fill this gap by studying the role of activation functions in the training dynamics of neural networks and their impact on catastrophic forgetting.Our study reveals that, besides sparse representations, the gradient sparsity of activation functions also plays an important role in reducing forgetting.Based on this insight, we propose a new class of activation functions, elephant activation functions, that can generate both sparse representations and sparse gradients.We show that the resilience of neural networks to forgetting can be significantly improved by simply replacing classical activation functions with elephant activation functions. 
Qingfeng Lan · Rupam Mahmood 🔗 


Investigating the Edge of Stability Phenomenon in Reinforcement Learning
(
Poster
)
>
Recent progress has been made in understanding optimisation dynamics in neural networks trained with fullbatch gradient descent with momentum with the uncovering of the edge of stability phenomenon in supervised learning. The edge of stability phenomenon occurs as the leading eigenvalue of the Hessian reaches the divergence threshold of the underlying optimisation algorithm for a quadratic loss, after which it starts oscillating around the threshold, and the loss starts to exhibit local instability but decreases over long time frames.In this work, we explore the edge of stability phenomenon in reinforcement learning (RL), specifically offpolicy Qlearning algorithms across a variety of data regimes, from offline to online RL. Our experiments reveal that, despite significant differences to supervised learning, such as nonstationarity of the data distribution and the use of bootstrapping, the edge of stability phenomenon can be present in offpolicy deep RL. Unlike supervised learning, however, we observe strong differences depending on the underlying loss, with DQN  using a Huber loss  showing a strong edge of stability effect that we do not observe with C51  using a cross entropy loss. Our results suggest that, while neural network structure can lead to optimisation dynamics that transfer between problem domains, certain aspects of deep RL optimisation can differentiate it from domains such as supervised learning. 
Rares Iordan · Mihaela Rosca · Marc Deisenroth 🔗 


Deep Neural Networks Extrapolate Cautiously in High Dimensions
(
Poster
)
>
Conventional wisdom suggests that neural network predictions tend to be unpredictable and overconfident when faced with outofdistribution (OOD) inputs. Our work aims to reassess this assumption, particularly with regards to neural networks with highdimensional inputs. We find that as input data becomes increasingly OOD, neural network predictions actually tend to converge towards a constant value, rather than extrapolating in arbitrary ways. Furthermore, this value often closely approximates the optimal inputindependent solution that minimizes training loss, which corresponds to a more cautious prediction for many common machine learning losses. Our empirical investigation suggests that this phenomenon exists across a broad array of datasets, distributional shifts, and loss functions. Furthermore, we study the mechanism responsible for this observed behavior, providing both an empirical and theoretical analysis. 
Katie Kang · Amrith Setlur · Claire Tomlin · Sergey Levine 🔗 


Implicit regularisation in stochastic gradient descent: from singleobjective to twoplayer games
(
Poster
)
>
Recent years have seen many insights on deep learning optimisation being brought forward by finding implicit regularisation effects of commonly used gradientbased optimisers. Understanding implicit regularisation can not only shed light on optimisation dynamics, but it can also be used to improve performance and stability across problem domains, from supervised learning to twoplayer games such as Generative Adversarial Networks.An avenue for finding such implicit regularisation effects has been quantifying the discretisation errors of discrete optimisers via continuoustime flows constructed by backward error analysis (BEA). The current usage of BEA is not without limitations, since not all the vector fields of continuoustime flows obtained using BEA can be written as a gradient, hindering the construction of modified losses revealing implicit regularisers. In this work, we provide a novel approach to use BEA, and show how our approach can be used to construct continuoustime flows with vector fields that can be written as gradients. We then use this to find previously unknown implicit regularisation effects, such as those induced by multiple stochastic gradient descent steps while accounting for the exact data batches used in the updates, and in generally differentiable twoplayer games. 
Mihaela Rosca · Marc Deisenroth 🔗 


How to escape sharp minima
(
Poster
)
>
Modern machine learning applications have seen a remarkable success of optimization algorithms that are designed to find flat minima. Motivated by this paradigm, this work formulates and studies the algorithmic question of how to find flat minima. As an initial effort, this work adopts the trace of hessian of the cost function as the measure of flatness, and formally defines the notion of approximate flat minima. Under this notion, we then design algorithms that find approximate flat minima efficiently. For general cost functions, we present a gradientbased algorithm that finds an approximate flat local minimum efficiently. The main component of the algorithm is to use gradients computed from randomly perturbed iterates to estimate a direction that leads to flatter minima. For the setting where the cost function is an empirical risk over training data, we present a faster algorithm that is inspired by a recently proposed practical algorithm called sharpnessaware minimization, supporting its success in practice. 
Kwangjun Ahn · Ali Jadbabaie · Suvrit Sra 🔗 


Adapting to Gradual Distribution Shifts with Continual Weight Averaging
(
Poster
)
>
Machine learning models are frequently deployed in settings where the data distributions are shifting gradually over time. However, existing methods for adapting to distribution shifts fail to properly leverage this implicit structure. We propose a simple method to continually adapt a model to distribution shift by using the exponential moving average of model weights over discrete timesteps. We refer to our method as Continually and Stochastically Averaging Weights (CSAW). We show that CSAW achieves stateoftheart performance on the WildTime benchmark of inthewild gradual temporal distribution shifts on a variety of datasets across vision, language and medical domains with improvements in both average and worst case OOD performance: +2.23% accuracy on Yearbook, +2.96% on FMoW, +0.87% on HuffPost, +1.43% on ArXiv, and +0.75% ROCAUC on MIMICMortality. We analyze the loss landscapes of sequentially finetuned models and show that they exhibit favorable mode connectivity properties which allows for weight averaging. 
Jared Fernandez · Saujas Vaduguru · Sanket Vaibhav Mehta · Yonatan Bisk · Emma Strubell 🔗 


On the Problem of Transferring Learning Trajectories Between Neural Networks
(
Poster
)
>
Training deep neural networks (DNNs) is computationally expensive, which is problematic especially when performing duplicated training runs, such as model ensemble or knowledge distillation. Once we have trained one DNN on some dataset, we have its learning trajectory (i.e., a sequence of intermediate parameters during training) which may potentially contain useful information for learning the dataset. However, there has been no attempt to utilize such information of a given learning trajectory for another training. In this paper, we formulate the problem of transferring a given learning trajectory from one initial parameter to another one, called learning transfer problem, and derive the first algorithm to approximately solve it by matching gradients successively along the trajectory via permutation symmetry. We empirically show that the transferred parameters achieve nontrivial accuracy before any direct training. 
Daiki Chijiwa 🔗 


Neural Collapse in the Intermediate Hidden Layers of Classification Neural Networks
(
Poster
)
>
Neural Collapse (NC) gives a precise description of the representations of classes in the final hidden layer of classification neural networks. This description provides insights into how these networks learn features and generalize well when trained past zero training error. However, to date, NC has only been studied in the final layer of these networks. In the present paper, we provide the first comprehensive empirical analysis of the emergence of NC in the intermediate hidden layers of these classifiers. We examine a variety of network architectures, activations, and datasets, and demonstrate that some degree of NC typically exists in most of the intermediate hidden layers of the network, where the degree of collapse in any given layer is typically positively correlated with the depth of that layer in the neural network. Moreover, we remark that: (1) almost all of the reduction in intraclass variance in the samples occurs in the shallower layers of the networks, (2) the angular separation between class means increases consistently with hidden layer depth, and (3) simple datasets require only the shallower layers of the networks to fully learn them, while more difficult ones require the entire network. Ultimately, these results provide granular insights into the structural propagation of features through classification neural networks. 
Liam Parker 🔗 


Flatter, Faster: Scaling Momentum for Optimal Speedup of SGD
(
Poster
)
>
Optimization algorithms typically show a tradeoff between good generalization and faster training times. While stochastic gradient descent (SGD) offers good generalization, adaptive gradient methods are quicker. Momentum can speed up SGD, but choosing the right hyperparameters can be challenging. We study training dynamics in overparametrized neural networks with SGD, label noise, and momentum. Our findings indicate that scaling the momentum hyperparameter $1\beta$ with the learning rate to the power of $2/3$ enhances training speed without compromising generalization. Our architectureindependent framework is built on the assumption of existence of a manifold of global minimizes, typically present in overparametrized models. We highlight the emergence of two distinct timescales in training dynamics, with maximum acceleration achieved when these timescales coincide, which leads to our proposed scaling limit. Matrixsensing, MLP on FashionMNIST, and ResNet18 on CIFAR10 experiments confirm the validity of our proposed scaling.

Aditya Cowsik · Tankut Can · Paolo Glorioso 🔗 


Which Features are Learned by Contrastive Learning? On the Role of Simplicity Bias in Class Collapse and Feature Suppression
(
Poster
)
>
Contrastive learning (CL) has emerged as a powerful technique for representation learning, with or without label supervision. However, supervised CL is prone to collapsing representations of subclasses within a class by not capturing all their features, and unsupervised CL may suppress harder classrelevant features by focusing on learning easy classirrelevant features; both significantly compromise representation quality. Yet, there is no theoretical understanding of \textit{class collapse} or \textit{feature suppression} at \textit{test} time. We provide the first unified theoretically rigorous framework to determine \textit{which} features are learnt by CL. Our analysis indicate that, perhaps surprisingly, bias of (stochastic) gradient descent towards finding simpler solutions is a key factor in collapsing subclass representations and suppressing harder classrelevant features. We also provide the first theoretical explanation for why employing supervised and unsupervised CL together yields higherquality representations, even when using commonlyused stochastic gradient methods. 
Yihao Xue · Siddharth Joshi · Eric Gan · PinYu Chen · Baharan Mirzasoleiman 🔗 


An improved residual based random forest for robust prediction
(
Poster
)
>
Random forest (RF) model, introduced by Breiman (2001), is a robust method with low signaltonoise ratio data and is very unlikely to overfit, although it could be true that RF deteriorates if meanshift contamination presents in the training set. This paper introduces a residual based solution, the penalized weighted random forest (PWRF) method, which modifies random forest (RF) model to improve robustness over systematic or trend contamination. This method dictates the impact from contamination in the training set based on the squared residual of each training observation, which provides the flexibility to deal with different types of data. The experiment suggests that PWRF provides competent if not better results that two robust RF methods and the original RF method. 
Mingyan Li 🔗 


How Does Adaptive Optimization Impact Local Neural Network Geometry?
(
Poster
)
>
Adaptive optimization methods are well known to achieve superior convergence relative to vanilla gradient methods. The traditional viewpoint explains this improved performance by arguing that adaptive algorithms mimic the behavior of a secondorder method by adapting to the global geometry of the loss function. We argue that in the context of neural network optimization, this traditional viewpoint is insufficient. Instead, we advocate for a local trajectory analysis. For iterate trajectories produced by running a generic optimization algorithm OPT, we introduce $R^{\text{OPT}}_{\text{med}}$, a statistic that is analogous to the condition number of the loss Hessian evaluated at the iterates. Through extensive experiments on language models, we show that adaptive methods such as Adam bias the trajectories towards regions where $R^{\text{Adam}}_{\text{med}}$ is small, where one might expect faster optimization. By contrast, SGD (with momentum) biases the trajectories towards regions where $R^{\text{SGD}}_{\text{med}}$ is comparatively large. We complement these empirical observations with a theoretical result that provably demonstrates this phenomenon in the simplified setting of a twolayer linear network.

Kaiqi Jiang · Dhruv Malik · Yuanzhi Li 🔗 


Effects of Overparameterization on SharpnessAware Minimization: A Preliminary Investigation
(
Poster
)
>
Sharpnessaware minimization (SAM) has exhibited its effectiveness in increasing generalization performance. Despite the concurrent empirical and theoretical evidence of its validity, however, there is little work that studies the effect of overparameterization on SAM. In this work, we experiment to observe how SAM changes its behavior with respect to different settings of overparameterization including sparsity. We find that SAM provides a larger benefit over stochastic gradient descent (SGD) in overparameterized regimes and lowsparsity settings. Furthermore, we discover that SAM shows different behaviors on large sparse models compared to their small dense counterparts despite having a similar number of parameters. 
Sungbin Shin · Dongyeop Lee · Namhoon Lee 🔗 


Highdimensional Learning Dynamics of Deep Neural Nets in the Neural Tangent Regime
(
Poster
)
>
In this paper, built upon recent advances in highdimensional characterization of the neural tangent kernel (NTK) with random matrix techniques in \cite{gu2022Lossless}, we derive \emph{precise} highdimensional training dynamics for deep and wide neural networks (DNNs).By assuming a Gaussian mixture model for the input features, \emph{exact} expression of highdimensional training mean squared error (MSE) is derived, as a function of the dimension $p$, sample size $n$, and the statistics of input features, the number of layer $L$, as well as the nonlinear activation function in each layer of the network.The theoretical results provide novel insight into the inner mechanism of DNNs, and in particular, into the interplay between activation function, network depth, and feature statistics.

Yongqi Du · Zenan Ling · Robert Qiu · Zhenyu Liao 🔗 


On the Equivalence Between Implicit and Explicit Neural Networks: A Highdimensional Viewpoint
(
Poster
)
>
Implicit neural networks have demonstrated remarkable success in various tasks. However, there is a lack of theoretical analysis of the connections and differences between implicit and explicit networks. In this paper, we study highdimensional implicit neural networks and provide the high dimensional equivalents for the corresponding conjugate kernels and neural tangent kernels. Built upon this, we establish the equivalence between implicit and explicit networks in high dimensions. 
Zenan Ling · Zhenyu Liao · Robert Qiu 🔗 


Hyperparameter Tuning using Loss Landscape
(
Poster
)
>
Hyperparameter tuning is crucial in training deep neural networks. In this paper, we propose a new method for hyperparameter tuning by (1) measuring multiple metrics related to the structure of the loss landscape, (2) empirically determining the "phase" of the loss landscape, and (3) providing an efficient way of hyperparameter tuning using the phase information. We demonstrate the effectiveness of our method through extensive experiments on network pruning tasks. We show that our method, named TempTuner, achieves significantly lower search time than both conventional grid search and more advanced sequential modelbased Bayesian optimization (SMBO). To the best of our knowledge, this is the first work to apply loss landscape analysis to the novel application of hyperparameter tuning in neural networks. 
Jianlong Chen · Qinxue Cao · Yefan Zhou · Konstantin Schürholt · Yaoqing Yang 🔗 


SharpnessAware Minimization Leads to LowRank Features
(
Poster
)
>
Sharpnessaware minimization (SAM) is a recently proposed method that minimizes the sharpness of the training loss of a neural network. While its generalization improvement is wellknown and is the primary motivation, we uncover an additional intriguing effect of SAM: reduction of the feature rank which happens at different layers of a neural network. We show that this lowrank effect occurs very broadly: for different architectures such as fullyconnected networks, convolutional networks, vision transformers and for different objectives such as regression, classification, languageimage contrastive training. To better understand this phenomenon, we provide a mechanistic understanding of how lowrank features arise in a simple twolayer network. We observe that a significant number of activations gets entirely pruned by SAM which directly contributes to the rank reduction. We confirm this effect theoretically and check that it can also occur in deep networks, although the overall rank reduction mechanism can be more complex, especially for deep networks with preactivation skip connections and selfattention layers. 
Maksym Andriushchenko · Dara Bahri · Hossein Mobahi · Nicolas Flammarion 🔗 


Layerwise Linear Mode Connectivity
(
Poster
)
>
In federated setup one performs an aggregation of separate local models multiple times during training in order to obtain a stronger model; most often it is a simple averaging of the parameters. Understanding why averaging works in a nonconvex setup, such as federated deep learning, is an open challenge that hinders obtaining highly performant global models. On i.i.d. datasets federated deep learning with frequent averaging is successful. The common understanding, however, is thatduring the independent training models are drifting away from each other and thus averaging may not work anymore. It can be seen from the perspective of the loss surface of the problem: on a nonconvex surface the average can become arbitrarily bad. Assuming local convexity contradicts to the empirical evidence showing that high loss barriers exist between models from the very beginning of the learning, even when training on the same data. Based on the observation that the learning process evolves differently in different layers, we investigate the barrier between modelsin a layerwise fashion. Our conjecture is that barriers preventing from successful federated training are caused by a particular layer or group of layers. 
Linara Adilova · Asja Fischer · Martin Jaggi 🔗 


Does Double Descent Occur in SelfSupervised Learning?
(
Poster
)
>
Most investigations into double descent have focused on supervised models while the few works studying selfsupervised settings find a surprising lack of the phenomenon. These results imply that double descent may not exist in selfsupervised models. We show this empirically using a standard and linear autoencoder, two previously unstudied settings. The test loss is found to have either a classical Ushape or to monotonically decrease instead of exhibiting a doubledescent curve. We hope that further work on this will help elucidate the theoretical underpinnings of this phenomenon. 
Alisia Lupidi · Yonatan Gideoni · Dulhan Jayalath 🔗 


On the Universality of Linear Recurrences Followed by Nonlinear Projections
(
Poster
)
>
We show that a family of sequence models based on recurrent linear layers (including S4, S5, and the LRU) interleaved with positionwise multilayer perceptions (MLPs) can approximate arbitrarily well any sufficiently regular nonlinear sequencetosequence map. The main idea behind our result is to see recurrent layers as compression algorithms that can faithfully store information about the input sequence in an inner state before it is processed by the highly expressive MLP. 
Antonio Orvieto · Soham De · Razvan Pascanu · Caglar Gulcehre · Samuel Smith 🔗 


Network Degeneracy as an Indicator of Training Performance: Comparing Finite and Infinite Width Angle Predictions
(
Poster
)
>
Neural networks are powerful functions with widespread use, but the theoretical behaviour of these functions is not fully understood. Creating deep neural networks by stacking many layers has achieved exceptional performance in many applications and contributed to the recent explosion of these methods. Previous works have shown that depth can exponentially increase the expressibility of the network. However, as networks get deeper and deeper, they are more susceptible to becoming degenerate. We observe this degeneracy in the sense that on initialization, inputs tend to become more and more correlated as they travel through the layers of the network. If a network has too many layers, it tends to approximate a (random) constant function, making it effectively incapable of distinguishing between inputs. This seems to affect the training of the network and cause it to perform poorly, as we empirically investigate in this paper. We use a simple algorithm that can accurately predict the level of degeneracy for any given fully connected ReLU network architecture, and demonstrate how the predicted degeneracy relates to training dynamics of the network. We also compare this prediction to predictions derived using infinite width networks. 
Cameron Jakub · Mihai Nica 🔗 


Implicitly Learned Invariance and Equivariance in Linear Regression
(
Poster
)
>
Can deep learning models generalize if their problem's underlying structure is unknown a priori? We analyze this theoretically and empirically in an idealistic setting for linear regression with invariant/equivariant data. We prove that linear regression models learn to become invariant/equivariant, with their weights being decomposed into a component that respects the symmetry and one that does not. These two components evolve independently over time, with the asymmetric component decaying exponentially given sufficient data. Extending these results to complex systems will be pursued in future work. 
Yonatan Gideoni 🔗 


Latent State Transitions in Training Dynamics
(
Poster
)
>
Neural networks have been shown to go through a number of distinct phases during training. Noted examples include grokking, where the model learns to generalize only after it memorizes the training data, and the induction head phase, where a language model's ability to copy from context dramatically improves. We seek an automated and principled framework for relating phase transitions, or sharp changes in a performance metric, to changes in the neural network itself. To this end, we propose to model changes in the neural network as latent state transitions using the hidden Markov model (HMM). Under this framework, we compute a variety of summary statistics for each training checkpoint and allow the HMM to infer state transitions in an unsupervised manner. We then inspect these inferred state transitions to better understand the phase transitions that neural networks go through during training. By applying the HMM to two grokking tasks and language modeling, we show that the HMM recovers the $L_2$ norm as an important metric for generalization in both tasks of grokking and relate variations in language modeling and grokking loss across training runs to specific changes in metrics over the neural network's weights.

Michael Hu · Angelica Chen · Naomi Saphra · Kyunghyun Cho 🔗 


Hessian Inertia in Neural Networks
(
Poster
)
>
The Hessian matrix of a neural network provides important insight into the training dynamics. While most works have focused on the eigenvalues of the Hessian matrix, we keep track of the top eigenvectors throughout training. We uncover a surprising phenomenon, which we term ``Hessian inertia'', where the eigenvectors of the Hessian tend not to move much during training. We hypothesis that Hessian inertia is related to feature learning, and show insights through a 2D example. 
Xuchan Bao · Alberto Bietti · Aaron Defazio · Vivien Cabannnes 🔗 


Generalization and Stability of Interpolating Neural Networks with Minimal Width
(
Poster
)
>
We investigate the generalization and optimization properties of shallow neuralnetwork classifiers trained by gradient descent in the interpolating regime. Specifically, in a realizable scenario where model weights can achieve arbitrarily small training error $\epsilon$ and their distance from initialization is $g(\epsilon)$,we demonstrate that gradient descent with $n$ training data achieves training error $O(g(1/T)^2\big/T)$ and generalization error $O(g(1/T)^2\big/n)$ at iteration $T$, provided there are at least $m=\Omega(g(1/T)^4)$ hidden neurons. We then show that our realizable setting encompasses a special case where data are separable by the model's neural tangent kernel. For this and logisticloss minimization, we prove the training loss decays at a rate of $\tilde O(1/ T)$ given polylogarithmic number of neurons $m=\Omega(\log^4 (T))$. Moreover, with $m=\Omega(\log^{4} (n))$ neurons and $T\approx n$ iterations, we bound the test loss by $\tilde{O}(1/ n)$. Our results differ from existing generalization outcomes using the algorithmicstability framework, which necessitate polynomial width and yield suboptimal generalization rates. Central to our analysis is the use of a new selfbounded weakconvexity property, which leads to a generalized local quasiconvexity property for sufficiently parameterized neuralnetwork classifiers. Eventually, despite the objective's nonconvexity, this leads to convergence and generalizationgap bounds that resemble those found in the convex setting of linear logistic regression.Although our primary focus is on twolayer neural networks, the analysis we present has a broad applicability and can be extended to address other nonconvex problems, including deep networks.

Hossein Taheri · Christos Thrampoulidis 🔗 


The Training Process of Many Deep Networks Explores the Same LowDimensional Manifold
(
Poster
)
>
We develop informationgeometric techniques to analyze the trajectories of the predictions of deep networks during training. By examining the underlying highdimensional probabilistic models, we reveal that the training process explores an effectively lowdimensional manifold. Networks with a wide range of architectures, sizes, trained using different optimization methods, regularization techniques, data augmentation techniques, and weight initializations lie on the same manifold in the prediction space. We study the details of this manifold to find that networks with different architectures follow distinguishable trajectories but other factors have a minimal influence; larger networks train along a similar manifold as that of smaller networks, just faster; and networks initialized at very different parts of the prediction space converge to the solution along a similar manifold. 
Jialin Mao · Han Kheng Teoh · Itay Griniasty · Rahul Ramesh · Rubing Yang · Mark Transtrum · James Sethna · Pratik Chaudhari 🔗 


An Adaptive Method for Minimizing Nonnegative Losses
(
Poster
)
>
This paper introduces Nonnegative GaussNewton (NGN), an adaptive optimization method that exploits nonnegativity, a common feature of the loss functions in machine learning. Utilizing a GaussNewtoninspired approximation for nonnegative losses, NGN offers an adaptive stepsize that can automatically warm up and decay while tracking the complex loss landscapes. We provide both convergence rates and empirical evaluations, and the results are very promising compared to the classical (stochastic) gradient method both in theory and practice. 
Antonio Orvieto · Lin Xiao 🔗 


The Marginal Value of Momentum for Small Learning Rate SGD
(
Poster
)
>
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small to mediumbatch training from scratch on ImageNet and finetuning language models on downstream tasks. 
Runzhe Wang · Sadhika Malladi · Tianhao Wang · Kaifeng Lyu · Zhiyuan Li 🔗 


Spectral Evolution and Invariance in Linearwidth Neural Networks
(
Poster
)
>
We investigate the spectral properties of linearwidth feedforward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates; we provide a theoretical justification for this observation and prove the invariance of the bulk spectra for both conjugate and neural tangent kernel. When the learning rate is large, we exhibit the emergence of an outlier whose corresponding eigenvector is aligned with the training data structure. We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel matrices exhibit heavy tail behavior. Simple examples are provided to explain when heavy tails can have better generalizations. Different spectral properties such as invariant bulk, spike, and heavytailed distribution correlate to how far the kernels deviate from initialization. To understand this phenomenon better, we show via a toy model that we can exhibit different spectral properties for different training strategies. Analogous phenomena also appear when we train conventional neural networks with realworld data. We conclude that monitoring the evolution of the spectra during training is an essential step toward understanding the training dynamics and feature learning. 
Zhichao Wang · Andrew Engel · Anand Sarwate · Ioana Dumitriu · Tony Chiang 🔗 


On the Joint Interaction of Models, Data, and Features
(
Poster
)
>
Learning features from data is one of the defining characteristics of deep learning, but our theoretical understanding of the role features play in deep learning is still rudimentary. To address this, we introduce a new tool, the \emph{interaction tensor}, for empirically analyzing the interaction between data and model through features. With the interaction tensor, we make several key observations about how features are distributed in data and how models with different random seeds learn different features. Based on these observations, we propose a conceptual framework for feature learning. Under this framework, the expected accuracy for a single hypothesis and agreement for a pair of hypotheses can both be derived in closedform. We demonstrate that the proposed framework can explain empirically observed phenomena, including the recently discovered Generalization Disagreement Equality (GDE) that allows for estimating the generalization error with only unlabeled data. Further, our theory also provides explicit construction of natural data distributions that break the GDE. Thus, we believe this work provides valuable new insight into our understanding of feature learning. 
YiDing Jiang · Christina Baek · Zico Kolter 🔗 


Predictive Sparse Manifold Transform
(
Poster
)
>
We present Predictive Sparse Manifold Transform (PSMT), a minimalistic, interpretable and biologically plausible framework for learning and predicting natural dynamics. PSMT incorporates two layers where the first sparse coding layer represents the input sequence as sparse coefficients over an overcomplete dictionary and the second manifold learning layer learns a geometric embedding space that captures topological similarity and dynamic temporal linearity in sparse coefficients. We apply PSMT on a natural video dataset and evaluate the reconstruction performance with respect to contextual variability, the number of sparse coding basis functions and training samples. We then interpret the dynamic topological organization in the embedding space. We next utilize PSMT to predict future frames compared with two baseline methods with a static embedding space. We demonstrate that PSMT with a dynamic embedding space can achieve better prediction performance compared to static baselines. Our work establishes that PSMT is an efficient unsupervised generative framework for prediction of future visual stimuli. 
Yujia Xie · Xinhui Li · Vince Calhoun 🔗 


Margin Maximization in Attention Mechanism
(
Poster
)
>
Attention mechanism is a central component of the transformer architecture which led to the phenomenal success of large language models. However, the theoretical principles underlying the attention mechanism are poorly understood, especially its nonconvex optimization dynamics. In this work, we explore the seminal softmaxattention model f(X)=v^TX^T softmax(XW^Tp), where, X is the tokenized input, v is the value weights, W is the keyquery weights, and p is a tunable token/prompt. We first prove that running gradient descent on p, or equivalently W, converges to a maxmargin solution that separates locallyoptimal tokens from nonoptimal ones. Next, consider optimizing v and p simultaneously with logistic loss, we identify conditions under which the regularization paths converge to their respective maxmargin solutions where v separates the input features based on their labels. Finally, we verify our results through numerical insights. 
Davoud Ataee Tarzanagh · Yingcong Li · Xuechen Zhang · Samet Oymak 🔗 


SupervisedContrastive Loss Learns Orthogonal Frames and Batching Matters
(
Poster
)
>
We are motivated by the question of what differences in the learning process occur when optimizing the supervised contrastive loss (SCL) and the crossentropy (CE) loss. Our main finding is that the geometry of featureembeddings learned by SCL forms an orthogonal frame (OF) regardless of the number of training examples per class. This is in contrast to the CE loss, for which previous work has shown that it learns embeddings geometries that are highly dependent on the class sizes. We arrive at our finding theoretically, by proving that the global minimizers of an unconstrained features model with SCL loss and entrywise nonnegativity constraints form an OF. We then validate the model's prediction by conducting experiments with standard deeplearning models on benchmark vision datasets. Finally, our analysis and experiments reveal that the batching scheme chosen during SCL training can play a critical role in speedingup convergence to the OF geometry. 
Ganesh Ramachandra Kini · Vala Vakilian · Tina Behnia · Jaidev Gill · Christos Thrampoulidis 🔗 


Characterizing and Improving Transformer Solutions for Dyck Grammars
(
Poster
)
>
Transformerbased models are capable of solving many complex tasks. Prior works formally justified such capabilities by providing a small set of constructions for each task, claiming that Transformers can implement certain classic algorithms. However, do Transformers typically converge to those proposed solutions through common optimization approaches? We tackle this question via the lens of a formal language called Dyck. We prove that even under this simple setup, the set of Transformer solutions is qualitatively rich, most of which do not match the intuitive constructions provided in prior works. Moreover, our analysis inspires a modified pretraining objective to guide Transformers towards better generalizing to processing longer Dyck sequences unseen during training. Extensive controlled experiments corroborate our findings. 
Kaiyue Wen · Yuchen Li · Bingbin Liu · Andrej Risteski 🔗 


Benign Overfitting of TwoLayer Neural Networks under Inputs with Intrinsic Dimension
(
Poster
)
>
Contrary to classical statistical theory, machine learning models with enormous sizes have shown high performances even when they interpolate the data. Such a phenomenon is called ``benign overfitting'' and has attracted much attention in the theoretical literature. Recent studies have clarified its theoretical perspective mostly in linear models, and there are yet only a few results for neural networks with feature learning. To address this issue, we theoretically investigate the statistical property of twolayer neural networks trained by noisy gradient descent in the setting where inputs have an intrinsic structure with lower dimensionality. We show when a true model is given by another neural network, the trained network can obtain the intrinsic feature of the true model through the gradient based training and eventually achieve benign overfitting. 
Shunta Akiyama · Kazusato Oko · Taiji Suzuki 🔗 


Implicit regularization of multitask learning and finetuning in overparameterized neural networks
(
Poster
)
>
It is common in deep learning to train networks on auxiliary tasks with the expectation that the learning will transfer, at least partially, to another task of interest. In this work, we investigate the inductive biases that result from learning auxiliary tasks, either simultaneously (multitask learning, MTL) or sequentially (pretraining and subsequent finetuning, PT+FT). In the simplified setting of twolayer diagonal linear networks trained with gradient descent, we show that multitask learning is biased to minimize an $\ell_{1,2}$ mixed norm, which incentivizes sharing features across tasks, and that PT+FT minimizes a similar penalty. In simulations of diagonal linear networks, we show that MTL and PT+FT improve sample efficiency to comparable extents compared to singletask learning. In ReLU networks, we find that while MTL still minimizes an analogue of the $\ell_{1,2}$ penalty, PT+FT exhibits a qualitatively different bias that permits learning of features correlated with (but not identical to) those used for the auxiliary task. As a result, we find that in realistic settings, MTL generalizes better when comparatively little data is available for the task of interest, while PT+FT outperforms it with more data available. Strikingly, the same effect holds for a deep convolutional architecture trained on an image classification task. Our results shed light on the structure leveraged by transfer learning strategies and suggest pathways to improving their performance.

Samuel Lippl · Jack Lindsey 🔗 


Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization
(
Poster
)
>
The reason why overparameterized neural networks can generalize remains mysterious. Existing proofs show that common stochastic optimizers tend to converge to flat minimizers of training loss, and thus a natural and popular explanation is that flatness implies generalization. This work critically examines this explanation. Through theoretical and empirical investigation, we identify the following three scenarios for twolayer ReLU networks: (1) flatness provably implies generalization; (2) there exist nongeneralizing flattest models and sharpness minimization algorithms fail to generalize poorly, and (3) perhaps most strikingly, there exist nongeneralizing flattest models, but sharpness minimization algorithms still generalize. Our results suggest that the relationship between sharpness and generalization subtly depends on the data distributions and the model architectures and sharpness minimization algorithms do not only minimize sharpness to achieve better generalization. This calls for the search for other explanations for the generalization of overparameterized neural networks. 
Kaiyue Wen · Tengyu Ma · Zhiyuan Li 🔗 


The phases of large learning rate gradient descent through effective parameters
(
Poster
)
>
Modern neural networks are undeniably successful. Numerous works study how the curvature of loss landscapes can affect the quality of solutions. In this work we consider the Hessian matrix during network training with large learning rates, which represents an attractive learning regime but is (in)famously unstable. Through the connection between “welldetermined” or “effective” parameters to the performance of neural nets, we study the instabilities of gradient descent, and we characterise the phases of gradient descent instabilities in these regimes. With a connection to the loss basin, we observe different regimes of Hessian rotation during these instabilities and a general tendency for solution flattening. 
Lawrence Wang · Stephen Roberts 🔗 


On Privileged and Convergent Bases in Neural Network Representations
(
Poster
)
>
In this study, we investigate whether the representations learned by neural networks possess a privileged and convergent basis. Specifically, we examine the significance of feature directions represented by individual neurons. First, we establish that arbitrary rotations of neural representations cannot be inverted (unlike linear networks), indicating that they do not exhibit complete rotational invariance. Subsequently, we explore the possibility of multiple bases achieving identical performance. To do this, we compare the bases of networks trained with the same parameters but with varying random initializations. Our study reveals two findings: (1) Even in wide networks such as WideResNets, neural networks do not converge to a unique basis; (2) Basis correlation increases significantly when a few early layers of the network are frozen identically.Furthermore, we analyze Linear Mode Connectivity, which has been studied as a measure of basis correlation. Our findings give evidence that while Linear Mode Connectivity improves with increased network width, this improvement is not due to an increase in basis correlation. 
Davis Brown · Nikhil Vyas · Yamini Bansal 🔗 


On the Effectiveness of SharpnessAware Minimization with Large Minibatches
(
Poster
)
>
Training with large minibatches can increase hardware utilization and reduce training time. However, recent studies suggest that using large minibatches often yields convergence to sharp minima, leading to poor generalization. In this work, we investigate the effectiveness of sharpness minimiza tion for largebatch training. Specifically, we evaluate the sharpnessaware minimization (SAM) algorithm and compare it to the standard stochastic gradient descent (SGD) under fixed step size settings. We perform exhaustive grid search to set optimal hyperparameters in this process. As a result, we find that SAM consistently outperforms SGD, but undergoes critical performance degradation in the largebatch training regime. 
Jinseok Chung · Seonghwan Park · Jaeho Lee · Namhoon Lee 🔗 


Fast Test Error Rates for Gradientbased Algorithms on Separable Data
(
Poster
)
>
Recent works have shown that when training a linear predictor over separable data using a gradient method and an exponentiallytailed loss function, the predictor asymptotically converges in direction to the maxmargin predictor. Consequently, the predictors asymptotically do not overfit. A recent study revealed that for certain gradient methods, overfitting does not occur even nonasymptotically, by obtaining finitetime generalization bounds for gradient flow, gradient descent (GD) and stochastic GD. In this work, we continue this line of research and obtain finitetime generalization bounds for other firstorder methods, namely normalized GD and Nesterov's accelerated GD. Our results show that for these methods, faster train loss convergence aligns with improved generalization in terms of the test error. 
Puneesh Deora · Bhavya Vasudeva · Vatsal Sharan · Christos Thrampoulidis 🔗 


On the Advantage of Lion Compared to signSGD with Momentum
(
Poster
)
>
This paper explores the relationship between the recently introduced Lion Optimizer and the similar signSGD with momentum  focusing on their different gradient averaging mechanisms prior to the sign operator. A precise formula for the averaging mechanisms of Lion and Signum is derived, and a comparison is made in the deterministic convex setting. Empirical investigations highlight the advantage of Lion, a finding further supported by a convergence guarantee in the stochastic setting. Our work suggests that Lion has the ability to effectively balance fast and slow averaging, leading to stable and rapid convergence. 
Alessandro Noiato · Luca Biggio · Antonio Orvieto 🔗 


On the Training and Generalization Dynamics of Multihead Attention
(
Poster
)
>
The training and generalization dynamics of Transformers' core mechanism, the Attention mechanism, remain underexplored. Besides, existing analyses primarily focus on singlehead attention. Inspired by the demonstrated benefits of learning with overparameterization when training multilayerperceptrons, we explore the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradientdescent training of a singlelayer multihead selfattention model, under a suitable realizability condition. Furthermore, we establish primitive conditions on the initialization that ensure the realizability condition holds. As a specific case study, we demonstrate that these conditions are satisfied for a simple yet representative tokenized mixture model. Overall, the analysis presented is quite general and has the potential to be extended to various datamodel and architecture variations. 
Puneesh Deora · Rouzbeh Ghaderi · Hossein Taheri · Christos Thrampoulidis 🔗 


Learning Stochastic Dynamical Systems as an Implicit Regularization with Graph Neural Network
(
Poster
)
>
We provide a stochastic Gumbel Graph Network (SGGN) for the longterm prediction of highdimensional time series data with complex noise and spatial correlations. Specifically, by training the drift and diffusion terms of a stochastic differential equation (SDE) with a Gumbel Matrix embedding, we can capture both the randomness and underlying spatial dynamics of input data. This framework allows us to investigate the implicit regularization effect of noise in SGGN. First, we derive the difference between the two corresponding loss functions in a small neighborhood of weight. Then, we employ Kuramoto's model to generate data for comparing the spectral density from the Hessian Matrix of the two loss functions. Moreover, we also apply our model to realworld scenarios such as wireless communication data. Experiment results demonstrate that SGGN exhibits better convergence, robustness, and generalization. 
Jin Guo · Ting Gao · Yufu Lan · Peng Zhang · Sikun Yang · Jinqiao Duan 🔗 


Graph Neural Networks Provably Benefit from Structural Information: A Feature Learning Perspective
(
Oral
)
>
Graph neural networks (GNNs) have pioneered advancements in graph representation learning, exhibiting superior feature learning and performance over multilayer perceptrons (MLPs) when handling graph inputs. However, understanding the feature learning aspect of GNNs is still in its initial stage. This study aims to bridge this gap by investigating the role of graph convolution within the context of feature learning theory in neural networks using gradient descent training. We provide a distinct characterization of signal learning and noise memorization in twolayer graph convolutional networks (GCNs), contrasting them with twolayer convolutional neural networks (CNNs). Our findings reveal that graph convolution significantly augments the benign overfitting regime over the counterpart CNNs, where signal learning surpasses noise memorization, by approximately factor $\sqrt{D}^{q2}$, with $D$ denoting a node's expected degree and $q$ being the power of the ReLU activation function where $q > 2$. These findings highlight a substantial discrepancy between GNNs and MLPs in terms of feature learning and generalization capacity after gradient descent training, a conclusion further substantiated by our empirical simulations.

Wei Huang · Yuan Cao · Haonan Wang · Xin Cao · Taiji Suzuki 🔗 


OverParameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron
(
Oral
)
>
link
We revisit the problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the overparameterization setting where the student network has $n\ge 2$ neurons. We prove the global convergence of randomly initialized gradient descent with a $O\left(T^{3}\right)$ rate. This is the first global convergence result for this problem beyond the exactparameterization setting ($n=1$) in which the gradient descent enjoys an $\exp(\Omega(T))$ rate. Perhaps surprisingly, we further present an $\Omega\left(T^{3}\right)$ lower bound for randomly initialized gradient flow in the overparameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that overparameterization can exponentially slow down the convergence rate. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exactparameterization case. We use a threephase structure to analyze GD's dynamics. Along the way, we prove gradient descent automatically balances student neurons, and use this property to deal with the nonsmoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exactparameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.

Weihang Xu · Simon Du 🔗 


Sharp predictions for minibatched proxlinear iterations in rank one matrix sensing
(
Oral
)
>
We consider the problem of estimating the factors of a rank1 matrix with i.i.d. Gaussian, rank1 measurements that are corrupted by noise. We derive a deterministic, low dimensional recursion which predicts the error of a minibatched proxlinear iteration for the nonconvex least squares loss associated to this statistical model. Our guarantees are fully nonasymptotic and are accurate for any batch size m ≥ 1 for a large range of stepsizes. Moreover, and in contrast with previous work, we show that the magnitude of the fluctuations of the empirical iterates around our deterministic predictions adapt to the problem error. Finally, we demonstrate how to use our deterministic predictions to perform hyperparameter tuning (e.g. stepsize and minibatch size selection) without ever running the method. 
MENGQI LOU · Kabir Chandrasekher · Ashwin Pananjady 🔗 


Scan and Snap: Understanding Training Dynamics and Token Composition in 1layer Transformer
(
Oral
)
>
Transformer architectures have shown impressive performance in multiple research domains and have become the backbone of many neural network models. However, there is limited understanding on how Transformer works. In particular, with a simple predictive loss, how the representation emerges from the gradient training dynamics remains a mystery. In this paper, we analyze the SGD training dynamics for 1layer transformer with one selfattention plus one decoder layer, for the task of next token prediction in a mathematically rigorous manner. We open the black box of the dynamic process of how the selfattention layer combines input tokens, and reveal the nature of underlying inductive bias. More specifically, with the assumption (a) no positional encoding, (b) long input sequence, and (c) the decoder layer learns faster than the selfattention layer, we prove that selfattention acts as a discriminative scanning algorithm: starting from uniform attention, it gradually attends more to key tokens that are distinct for a specific next token to be predicted, and pays less attention to common key tokens that occur across different next tokens. Among distinct tokens, it progressively drops attention weights, following the order of low to high cooccurrence between the key and the query token in the training set. Interestingly, this procedure does not lead to winnertakesall, but decelerates due to a phase transition that is controllable by the learning rates of the two layers, leaving (almost) fixed token combination. We verify this scan and snap dynamics on synthetic and realworld data (WikiText). 
Yuandong Tian · Yiping Wang · Beidi Chen · Simon Du 🔗 


Learning in the Presence of Lowdimensional Structure: A Spiked Random Matrix Perspective
(
Oral
)
>
We consider the learning of a singleindex target function $f_*: \mathbb{R}^d\to\mathbb{R}$ under spiked covariance data: $f_*(\boldsymbol{x}) = \textstyle\sigma_*(\frac{1}{\sqrt{1+\theta}}\langle\boldsymbol{x},\boldsymbol{\mu}\rangle)$, $\boldsymbol{x}\overset{\small\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,\boldsymbol{I_d} + \theta\boldsymbol{\mu}\boldsymbol{\mu}^\top),$ where the link function $\sigma_*:\mathbb{R}\to\mathbb{R}$ is a degree$p$ polynomial with information exponent $k$ (defined as the lowest degree in the Hermite expansion of $\sigma_*$), and it depends on the projection of input $\boldsymbol{x}$ onto the spike (signal) direction $\boldsymbol{\mu}\in\mathbb{R}^d$. In the proportional asymptotic limit where the number of training examples $n$ and the dimensionality $d$ jointly diverge: $n,d\to\infty, d/n\to\gamma\in(0,\infty)$, we ask the following question: how large should the spike magnitude $\theta$ (i.e., strength of the lowdimensional component) be, in order for $(i)$ kernel methods, $(ii)$ neural network trained with gradient descent, to learn $f_*$? We show that for kernel ridge regression, $\theta = \Omega\big(d^{1\frac{1}{p}}\big)$ is both sufficient and necessary. Whereas for GDtrained twolayer neural network, $\theta=\Omega\big(d^{1\frac{1}{k}}\big)$ suffices. Our result demonstrates that both kernel method and neural network benefit from lowdimensional structures in the data; moreover, since $k\le p$ by definition, neural network can adapt to such structure more effectively.

Jimmy Ba · Murat Erdogdu · Taiji Suzuki · Zhichao Wang · Denny Wu 🔗 