There is a curious trend in machine learning (ML): researchers developing the most capable large language models (LLMs) increasingly evaluate them using manual methods such as red teaming. In red teaming, researchers hire workers to manually try to break the LLM in some form by interacting with it. Similarly, some users (including myself) pick their preferred LLM assistant by manually trying out various models – checking each LLM’s “vibe”. Given that LLM researchers and users both actively seek to automate all sorts of other tasks, red teaming and vibe checks are surprisingly manual evaluation processes. This trend towards manual evaluation hints at fundamental problems that prevent more automatic evaluation methods, such as benchmarks, to be used effectively for LLMs (Ganguli et al., 2023 Ganguli, D., Schiefer, N., Favaro, M. & Clark, J.(2023, 10). Retrieved from https://www.anthropic.com/index/evaluating-ai-systems ; La Malfa et al., 2023 La Malfa, E., Petrov, A., Frieder, S., Weinhuber, C., Burnell, R., Cohn, A., Shadbolt, N. & Wooldridge, M. (2023). The ARRT of Language-Models-as-a-Service: Overview of a New Paradigm and its Challenges. https://doi.org/10.48550/arXiv.2309.16573 ). In this blog post, I aim to give an illustrated overview of the problems preventing LLM benchmarks from being a fully satisfactory alternative to more manual approaches.
...