Timezone: »

Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models
Hong Liu · Sang Michael Xie · Zhiyuan Li · Tengyu Ma

Thu Jul 27 07:04 PM -- 07:12 PM (PDT) @ Ballroom A

Language modeling on large-scale datasets improves performance of various downstream tasks. The validation pre-training loss is often used as the evaluation metric for language models since the pre-training loss tends to be well-correlated with downstream performance (which is itself hard to evaluate comprehensively). Contrary to the conventional wisdom, this paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not. We identify three ways to produce models with the same pre-training loss but different downstream performance: continue pre-training after convergence, increasing the model size, and changing the pre-training algorithms. These experiments demonstrate the existence of implicit bias of pre-training algorithms---among models with the same minimal pre-training loss, they implicitly prefer more transferable ones. Toward understanding this implicit bias, we prove that SGD with standard mini-batch noise implicitly prefers flatter minima of pre-training loss in language models, and empirically observe a strong correlation between flatness (measured by the trace of Hessian) and downstream performance among models with the same pre-training loss. We also prove in a synthetic language setting that among models with the minimal pre-training loss, the flattest model transfers to downstream tasks.

Author Information

Hong Liu (Stanford University)
Sang Michael Xie (Stanford University)
Zhiyuan Li (Computer Science Department, Stanford University)
Tengyu Ma (Stanford University)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors