Tutorials
Probabilistic Numerics — Computation is Machine Learning
Philipp Hennig ⋅ Marvin Pförtner ⋅ Tim Weiland
View full details
Unifying Attention and Diffusion with Kan Extension Transformers: Structured Deep Learning with Diagrammatic Backpropagation
Sridhar Mahadevan
View full details
Unlearning Data at Scale
Vinith Suriyakumar ⋅ Gautam Kamath ⋅ Ashia Wilson
View full details
Diffusion and Flow-Matching: From Memorization to Generalization & Beyond
Mathurin Massias ⋅ Quentin Bertrand
View full details
Is numerical optimization theory irrelevant to machine learning practice in 2026?
Mark Schmidt
We are seeing more numerical optimization theory papers published than ever before. These papers often make unrealistic assumptions or propose algorithms that never get adopted. So is all this optimization theory largely useless?
In this tutorial I show how some surprisingly simple optimization ideas can explain a wide variety of the implementation choices we make when training modern deep learning models. Some of these ideas might have let us skip some generations of grad-student descent, or have led to state-of-the-art tricks in modern architectures. On the other hand, I will highlight how some important practical ideas are not explained by optimization theory and where we can go from here.
Here is a list of keywords to get you (and your LLM sidekick) interested in attending: Adam and [*]A[*]d[*]a[*]m[*], Muon and its friends/enemies, critical-ish batch size, the RMSnorm and skip connection love affair, dead ReLUs and living SwiGLU, Schedule-Free and WSD and muP and max\_grad\_norm = 1.0, variance reduction and shuffle=True, and maybe edge-of-stability/catapults/feature-learning. I may also tell you why your second-order stochastic optimization method did not work.
Show more
In this tutorial I show how some surprisingly simple optimization ideas can explain a wide variety of the implementation choices we make when training modern deep learning models. Some of these ideas might have let us skip some generations of grad-student descent, or have led to state-of-the-art tricks in modern architectures. On the other hand, I will highlight how some important practical ideas are not explained by optimization theory and where we can go from here.
Here is a list of keywords to get you (and your LLM sidekick) interested in attending: Adam and [*]A[*]d[*]a[*]m[*], Muon and its friends/enemies, critical-ish batch size, the RMSnorm and skip connection love affair, dead ReLUs and living SwiGLU, Schedule-Free and WSD and muP and max\_grad\_norm = 1.0, variance reduction and shuffle=True, and maybe edge-of-stability/catapults/feature-learning. I may also tell you why your second-order stochastic optimization method did not work.
Adaptive Reasoning in LLMs: From Post-Training to Test-Time Learning
Akhil Arora ⋅ Nouha Dziri
View full details
Evaluating and Training LLMs for Math Copilots and Theorem Proving
Simon Frieder ⋅ Philip Vonderlind
View full details
Calibration: From Predictions to Decisions, Collaboration, and Alignment
Aaron Roth ⋅ Collina ⋅ Ira Globus-Harris
View full details
New Techniques for Sequence Prediction: Spectral Filtering and Preconditioning
Elad Hazan ⋅ Annie Marsden
View full details
Successful Page Load