Tutorial

Challenges in Language Model Evaluations

Lintang Sutawika · Hailey Schoelkopf

2024 Tutorial

Abstract

The field of machine learning relies on benchmarking and evaluation datasets to accurately track progress in the field and assess the efficacy of new models and methodologies. For this reason, good evaluation practices and accurate reporting are crucial. However, language models (LMs) not only inherit the challenges previously faced in benchmarking, but also introduce a slew of novel considerations which can make proper comparison across models difficult, misleading, or near-impossible. In this tutorial, we aim to bring attendees up to speed on the state of LM evaluation, and highlight current challenges in evaluating language model performance through discussing the various fundamental methods commonly associated with evaluating progress in language model research. We will then discuss how these common pitfalls can be addressed and what considerations should be taken to enhance future work, especially as we seek to evaluate ever more complex properties of LMs.

Video

Chat is not available.