Timezone: »

Epoch-Wise Double Descent: A Theory of Multi-scale Feature Learning Dynamics
Mohammad Pezeshki · Amartya Mitra · Yoshua Bengio · Guillaume Lajoie

A key challenge in building theoretical foundations for deep learning is their complex optimization dynamics. Such dynamics result from high-dimensional interactions between parameters, leading to non-trivial behaviors. In this regard, a particularly puzzling phenomenon is the double descent'' of the generalization error with increasing model complexity (model-wise) or training time (epoch-wise). While model-wise double descent has been a subject of extensive study of recent, the origins of the latter are much less clear. To bridge this gap, in this work, we leverage tools from statistical physics to study a simple teacher-student setup exhibiting epoch-wise double descent similar to deep neural networks. In this setting, we derive closed-form analytical expressions for the evolution of generalization error as a function of the training time. Crucially, this provides a new mechanistic explanation of epoch-wise double descent, suggesting that it can be attributed to features being learned at different time scales. Summarily, while a fast-learning feature is over-fitted, a slower-learning feature starts to fit, resulting in a non-monotonous generalization curve.

#### Author Information

##### Amartya Mitra (University of California, RIverside / Mila)

I lead the Methods & Engg. group at Capgemini America, where my work focuses on incorporating SOTA ML methodologies into financial services. Prior to this, I received my doctorate in theoretical physics at UC Riverside, where I worked on game optimization and generalization dynamics.