Skip to yearly menu bar Skip to main content


Tutorial

Challenges in Language Model Evaluations

Lintang Sutawika · Hailey Schoelkopf

Lehar 1-4
[ ]
Mon 22 Jul 6:30 a.m. PDT — 8:30 a.m. PDT

Abstract:

The field of machine learning relies on benchmarking and evaluation datasets to accurately track progress in the field and assess the efficacy of new models and methodologies. For this reason, good evaluation practices and accurate reporting are crucial. However, language models (LMs) not only inherit the challenges previously faced in benchmarking, but also introduce a slew of novel considerations which can make proper comparison across models difficult, misleading, or near-impossible. In this tutorial, we aim to bring attendees up to speed on the state of LM evaluation, and highlight current challenges in evaluating language model performance through discussing the various fundamental methods commonly associated with evaluating progress in language model research. We will then discuss how these common pitfalls can be addressed and what considerations should be taken to enhance future work, especially as we seek to evaluate ever more complex properties of LMs.

Chat is not available.