Skip to yearly menu bar Skip to main content


Poster Session
in
Workshop: High-dimensional Learning Dynamics Workshop: The Emergence of Structure and Reasoning

Best Paper Awards

[ ]
Fri 26 Jul 6:30 a.m. PDT — 6:45 a.m. PDT

Abstract:
  1. Behrooz Tahmasebi, Ashkan Soleymani, Dara Bahri, Stefanie Jegelka, Patrick Jaillet, A Universal Class of Sharpness-Aware Minimization Algorithms

  2. Derek Lim, Theo Putterman, Robin Walters, Haggai Maron, Stefanie Jegelka, The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof

  3. George Wang, Matthew Farrugia-Roberts, Jesse Hoogland, Lian Carroll, Susan Wei, Daniel Murfet, Loss landscape geometry reveals stagewise development of transformers

Abstract: Recently, there has been a surge in interest in developing optimization algorithms for overparameterized models as achieving generalization is believed to require algorithms with suitable biases. This interest centers on minimizing sharpness of the original loss function; the Sharpness-Aware Minimization (SAM) algorithm has proven effective. However, existing literature focuses on only a few sharpness measures (such as the maximum eigenvalue/trace of the training loss Hessian), which may not necessarily yield meaningful insights for non-convex optimization scenarios (e.g., neural networks). Moreover, many sharpness measures show sensitivity to parameter invariances in neural networks, e.g., they magnify significantly under rescaling parameters. Hence, here we introduce a new class of sharpness measures leading to sharpness-aware objective functions. We prove that these measures are universally expressive, allowing any function of the training loss Hessian matrix to be represented by choosing appropriate hyperparameters. Furthermore, we show that the proposed objective functions explicitly bias towards minimizing their corresponding sharpness measures. Finally, as an example of our proposed general framework, we present Frob-SAM and Det-SAM, which are specifically designed to minimize the Frobenius norm and the determinant of the Hessian of the training loss, respectively. We also demonstrate the advantages of our general framework through an extensive series of experiments.

Abstract: Many algorithms and observed phenomena in deep learning appear to be affected by parameter symmetries --- transformations of neural network parameters that do not change the underlying neural network function. These include linear mode connectivity, model merging, Bayesian neural network inference, metanetworks, and several other characteristics of optimization or loss-landscapes. In this work, we empirically investigate the impact of neural parameter symmetries by introducing new neural network architectures that have reduced parameter space symmetries. We develop two methods, with some provable guarantees, of modifying standard neural networks to reduce parameter space symmetries. With these new methods, we conduct a comprehensive experimental study consisting of multiple tasks aimed at assessing the effect of removing parameter symmetries. Our experiments reveal several interesting observations on the empirical impact of parameter symmetries; for instance, we observe linear mode connectivity and monotonic linear interpolation in our networks, without any alignment of weight spaces.

Abstract: The development of the internal structure of neural networks throughout training occurs in tandem with changes in the local geometry of the population loss. By quantifying the degeneracy of this geometry using the recently proposed Local Learning Coefficient, we show that the training process for a transformer language model can be decomposed into discrete developmental stages. We connect these stages to interpretable shifts in input–output behavior and developments in internal structure. These findings offer new insights into transformer development and underscore the crucial role of loss landscape geometry in understanding the dynamics of deep learning.

Chat is not available.