Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Over-parameterization: Pitfalls and Opportunities

Epoch-Wise Double Descent: A Theory of Multi-scale Feature Learning Dynamics

Mohammad Pezeshki · Amartya Mitra · Yoshua Bengio · Guillaume Lajoie


Abstract:

A key challenge in building theoretical foundations for deep learning is their complex optimization dynamics. Such dynamics result from high-dimensional interactions between parameters, leading to non-trivial behaviors. In this regard, a particularly puzzling phenomenon is the ``double descent'' of the generalization error with increasing model complexity (model-wise) or training time (epoch-wise). While model-wise double descent has been a subject of extensive study of recent, the origins of the latter are much less clear. To bridge this gap, in this work, we leverage tools from statistical physics to study a simple teacher-student setup exhibiting epoch-wise double descent similar to deep neural networks. In this setting, we derive closed-form analytical expressions for the evolution of generalization error as a function of the training time. Crucially, this provides a new mechanistic explanation of epoch-wise double descent, suggesting that it can be attributed to features being learned at different time scales. Summarily, while a fast-learning feature is over-fitted, a slower-learning feature starts to fit, resulting in a non-monotonous generalization curve.