Challenges and Considerations in Language Model Evaluation
NLP and Machine Learning rely on benchmarks and evaluation to accurately track progress in the field and assess the efficacy of new models and methodologies. For this reason, good evaluation practices and accurate reporting are crucial. However, language models (LMs) not only inherit the challenges previously faced in benchmarking, but also introduce a slew of novel considerations which can make proper comparison across models difficult, misleading, or near-impossible. In this talk, we will discuss the state of language model evaluation, and highlight current challenges in evaluating language model performance through discussing the various methods of evaluation, tasks and benchmarks commonly associated with evaluating progress in language model research. We will then discuss how these common pitfalls can be addressed and what considerations should be taken to enhance future work.
Lintang Sutawika
Researcher at EleutherAI
Lintang Sutawika (he/him) is a Researcher at EleutherAI and an incoming PhD student at Carnegie Mellon University. His research interests encompass understanding how to make language technologies more capable, interpretable, and ultimately safe and useful. His work involves understanding how language models work and novel methods to expand their capabilities which includes the Pythia suite of open language models, inducing zero-shot model capabilities through multitask finetuning approaches, observing model training dynamics, and investigating methods to extend models to other languages. He is also a core maintainer of EleutherAI’s LM Evaluation Harness framework to help language model evaluation.
Outline:
-
A Key Challenge in LM Evaluation
-
What do we want to evaluate?
-
LM - Specific Complications
-
Evaluating Models vs Systems
-
Life of a Benchmark
-
Overfitting
-
Addressing Evaluation Pitfalls
UPCOMING LIVE TRAINING
Register now to save 30%