Disclaimer: This post collects some notes from exploring language model evaluation approaches in early 2023. It will (likely) be outdated by the time you read this. Improvement suggestions are welcome and can be sent to
contact (at) arduin.io
.
1. Introduction
Language models (LMs) are notoriosly difficult to evaluate. Modern LMs are used for a wide variety of complex downstream tasks, such as text translation or conversation. This diversity of tasks means that no single metric can capture overall language model performance. Even for individual tasks, it can be difficult to come up with well-setup benchmarks to measure performance. Some benchmarks have been shown to allow models to achieve top performance without truly mastering the underlying tasks (Niven & Kao, 2019 Niven, T. & Kao, H. (2019). Probing Neural Network Comprehension of Natural Language Arguments. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1459 ; Zellers et al., 2019 Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A. & Choi, Y. (2019). HellaSwag: Can a Machine Really Finish Your Sentence?. https://doi.org/10.48550/arXiv.1905.07830 ). Unintended statistical biases in the training data can allow superficial learning using (basically) just word counts – without solving the intended (more complex) linguistic task. Yet, as LMs become ever more capable and see real-world deployment, reliable and interpretable evaluation methods only become more important.
In this post I first go through a few examples of current standard benchmarks in natural language processing (NLP) to illustrate the status quo of LM evaluation. In the second part, I consider some novel approaches that have recently been proposed to better capture modern models’ capabilities.
2. Standard benchmarks
There is a vast collection of NLP benchmarks. To narrow down the scope of this post, I focus on benchmarks used to evaluate InstructGPT (Ouyang et al., 2022 Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J. & Lowe, R. (2022). Training language models to follow instructions with human feedback. Retrieved from http://arxiv.org/abs/2203.02155 ) (closely related to OpenAI’s ChatGPT models) as illustrative examples. The aim behind InstructGPT (and related large language models) is to provide a multi-purpose model that can be used for a diverse set of downstream tasks. This diversity is also reflected in the kind of benchmarks these models are tested on. In this post, I only consider examples of scalable evaluation methods that do not require direct human feedback.
Standard benchmarks can be split into two categories: task-specific and alignment benchmarks. Task-specific benchmarks attempt to evaluate the models ability to perform specific tasks, such as translation or discrete reasoning. Each benchmark typically covers one such task. Alignment benchmarks cover additional text charateristics we would like our LM output to have, such as not being racist or being unbiased. These additional characteristics can be more difficult to rigorously define and test for. Nevertheless, these characteristics are very important for deploying LMs safely. The figure below provides an overview of standard NLP benchmarks.
Task-specific benchmarks
Below is a list of some of the task-specific benchmarks used for InstructGPT.
SQuAD (Rajpurkar et al., 2018 Rajpurkar, P., Jia, R. & Liang, P. (2018). Know What You Don’t Know: Unanswerable Questions for SQuAD. https://doi.org/10.48550/arXiv.1806.03822 ) |
The Stanford Question Answering Dataset (SQuAD) v2.0 consists of a large corpus of small text excerpts from Wikipedia articles (100,000+). Each text has corresponding questions and anwers created by crowdworkers. The original SQuAD v1.0
(Rajpurkar
et al., 2016
Rajpurkar,
P.,
Zhang,
J.,
Lopyrev,
K. & Liang,
P.
(2016).
SQuAD: 100,000+ Questions for Machine Comprehension of Text.
https://doi.org/10.48550/arXiv.1606.05250
) was improved in v2.0 to include unanswerable questions. See example questions from both papers below. |
DROP (Dua et al., 2019 Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S. & Gardner, M. (2019). DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. https://doi.org/10.48550/arXiv.1903.00161 ) |
The DROP dataset aims to assess models’ ability in discrete reasoning over text in a paragraph (DROP). The dataset consists of 96,000 questions. Given a passage of text, the model is given a question that requires reasoning using discrete operations such as addition, counting, or sorting. See example questions below. |
HellaSwag (Zellers et al., 2019 Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A. & Choi, Y. (2019). HellaSwag: Can a Machine Really Finish Your Sentence?. https://doi.org/10.48550/arXiv.1905.07830 ) |
The HellaSwag dataset consists of 70,000 questions attempting to evaluate the commonsense natural language inference ability of a model. Given a text fragment referred to as context, the model is asked to select the most plausible of multiple possible endings. HellaSwag is an extension of the SWAG data set by
Zellers
et al. (2018
Zellers,
R.,
Bisk,
Y.,
Schwartz,
R. & Choi,
Y.
(2018).
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference.
https://doi.org/10.48550/arXiv.1808.05326
). In addition to the text data from video captioning used in SWAG, HellaSwag adds text data from WikiHow. Both datasets use adverserial filtering to address the issue of annotation artifacts, where human labellers leave unintended biases in wrong labels. During the data generation a generator and a discriminator model are jointly used to filter out endings that are too easy to distinguish. With this iterative process, this data set provides a template how to come up with more challenging data sets for ever more capable language models. |
WMT 2015 French to English translation (Bojar et al., 2015 Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C., Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C., Specia, L. & Turchi, M. (2015). Findings of the 2015 Workshop on Statistical Machine Translation. Proceedings of the Tenth Workshop on Statistical Machine Translation. 1–46. Retrieved from https://www.research.ed.ac.uk/en/publications/findings-of-the-2015-workshop-on-statistical-machine-translation ) |
A dataset challenges a model to translate from french text to english text. In the original workshop, models were evaluated using human annotators that make a pairwise comparison of the translation of the same text by different systems. For purpose of evaluating InstructGPT instead of the human evaluation the automatic translation quality score BLEU
(Papineni
et al., 2002
Papineni,
K.,
Roukos,
S.,
Ward,
T. & Zhu,
W.
(2002).
BLEU: a method for automatic evaluation of machine translation.
) was used. |
Alignment benchmarks
In addition to solving a primary task like creating good autocomplete suggestions, we likely also want to make sure that the generated text is not racist or insulting. A number of dedicated benchmarks were created to address these secondary human alignment considerations.
RealToxicityPrompts (Gehman et al., 2020 Gehman, S., Gururangan, S., Sap, M., Choi, Y. & Smith, N. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. https://doi.org/10.48550/arXiv.2009.11462 ) |
RealToxicityPrompts consists of 100,000 partial sentences that, when comleted by an LM, may result in racist, sexist or otherwise toxic language. The prompts are given to the LM in a conditional language generation (or auto complete) task. The Perspective API is used to evaluate the toxicity. The use of this external API as a black-box evaluation method means that the results may not be reproducible - as the API’s output may change or be discontinued. More generally, as the authors point out, current methods are unable to adapt an LM to never say toxic things. This limitations presents serious challenges for the safe deployment of LM-based applications. |
Winogender (Rudinger et al., 2018 Rudinger, R., Naradowsky, J., Leonard, B. & Van Durme, B. (2018). Gender Bias in Coreference Resolution. https://doi.org/10.48550/arXiv.1804.09301 ) |
The Winogender data set aims to detect gender bias in language models. The data set consists of a number of similar sentences that reference occupations and their gender. For the evaluation of InstructGPT (as far as I understand it), pairs of the sentences that only vary in gender were considered. Then the relative probability of produce one of the sentences was computed, to see for example if the model thought paramedics are more likely to be male. |
CrowS-Pairs (Nangia et al., 2020 Nangia, N., Vania, C., Bhalerao, R. & Bowman, S. (2020). CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. https://doi.org/10.48550/arXiv.2010.00133 ) |
The CrowS-Pairs dataset consists of 1508 pairs of sentences, each with one more stereotyping than the other. The pairs cover various bias including race, religion and age. In the context of InstructGPT, the dataset is used similarly to Winogender: the probability of computing the more stereotyping sentence is computed. |
TruthfulQA (Lin et al., 2022 Lin, S., Hilton, J. & Evans, O. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods. https://doi.org/10.48550/arXiv.2109.07958 ) |
The TruthfulQA benchmark consists of 817 questions, each crafted such that some humans might answer them incorrectly due to misconceptions. The authors used both human and automated evaluation to assess the truthfullness of model answers. They demonstrated that a finetuned GPT-3 model was 90-96% accurate in detecting truthfulness of answers by other models. |
In addition to the benchmarks described above, InstructGPT was also evaluated on the following benchmarks: SST (Socher et al., 2013 Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C., Ng, A. & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. ), RTE and WSC parts of SuperGLUE (Wang et al., 2019 Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O. & Bowman, S. (2019). SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. Curran Associates, Inc.. Retrieved from https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html ), CNN/Daily Mail Summarization (Nallapati et al., 2016 Nallapati, R., Zhou, B., santos, C., Gulcehre, C. & Xiang, B. (2016). Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. https://doi.org/10.48550/arXiv.1602.06023 ), and Reddit TLDR Summarization (Völske et al., 2017 Völske, M., Potthast, M., Syed, S. & Stein, B. (2017). Tl; dr: Mining reddit to learn automatic summarization. ). Further, there is a large number of additional benchmarks that were not considered for InstructGPT. For examples see the list of 200+ benchmarks available in the lm-evaluation-harness by EleutherAI.
3. Novel Benchmarks
LM-generated benchmarks
Perez et al. (2022 Perez, E., , Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, A., Mann, B., Israel, B., Seethor, B., McKinnon, C., Olah, C., Yan, D., Amodei, D., Amodei, D., Drain, D., Li, D., Tran-Johnson, E., Khundadze, G., Kernion, J., Landis, J., Kerr, J., Mueller, J., Hyun, J., Landau, J., Ndousse, K., Goldberg, L., Lovitt, L., Lucas, M., Sellitto, M., Zhang, M., Kingsland, N., Elhage, N., Joseph, N., Mercado, N., DasSarma, N., Rausch, O., Larson, R., McCandlish, S., Johnston, S., Kravec, S., Showk, S., Lanham, T., Telleen-Lawton, T., Brown, T., Henighan, T., Hume, T., Bai, Y., Hatfield-Dodds, Z., Clark, J., Bowman, S., Askell, A., Grosse, R., Hernandez, D., Ganguli, D., Hubinger, E., Schiefer, N. & Kaplan, J. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. https://doi.org/10.48550/arXiv.2212.09251 ) introduce evaluation data sets generated with the help of LMs. They create a variety of data sets, from simple “yes/no” and multiple choice questions to bias detecting schema similar to Winogender (Rudinger et al., 2018 Rudinger, R., Naradowsky, J., Leonard, B. & Van Durme, B. (2018). Gender Bias in Coreference Resolution. https://doi.org/10.48550/arXiv.1804.09301 ). Compared to the conventional approach, they demonstrate a highly efficient data set generation process. Their approach may influence the generation of many future data sets.
Red teaming
Ganguli et al. (2022 Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., Jones, A., Bowman, S., Chen, A., Conerly, T., DasSarma, N., Drain, D., Elhage, N., El-Showk, S., Fort, S., Hatfield-Dodds, Z., Henighan, T., Hernandez, D., Hume, T., Jacobson, J., Johnston, S., Kravec, S., Olsson, C., Ringer, S., Tran-Johnson, E., Amodei, D., Brown, T., Joseph, N., McCandlish, S., Olah, C., Kaplan, J. & Clark, J. (2022). Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. https://doi.org/10.48550/arXiv.2209.07858 ) share their experience manually red-teaming LMs. They define LM read teaming as “using manual or automated methods to adversarially probe a language model for harmful outputs, and then updating the model to avoid such outputs”. Perez et al. (2022 Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N. & Irving, G. (2022). Red Teaming Language Models with Language Models. https://doi.org/10.48550/arXiv.2202.03286 ) introduce an automated version of red teaming, where the red team itself is a language model.
4. Conclusion
In this post, I gave an overview of some of the NLP benchmarks used for evaluating modern language models (LMs), such as InstructGPT. Despite best efforts and in many ways impressive performance, the output of current state-of-the-art LMs often fails to satisfy many desirable characterics, such as truthfulness. Reliably quantifying these issues presents on important step towards building better LMs and developing downstream applications. Adaptive benchmarks, such as HellaSwag with adverserial filtering (Zellers et al., 2019 Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A. & Choi, Y. (2019). HellaSwag: Can a Machine Really Finish Your Sentence?. https://doi.org/10.48550/arXiv.1905.07830 ), present one promising direction to creating benchmarks that co-evolve as models grow more and more powerful.
Acknowledgements
I would like to thank Simon Mathis for pointing me to the excellent paper by Niven & Kao (2019 Niven, T. & Kao, H. (2019). Probing Neural Network Comprehension of Natural Language Arguments. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1459 ).
References
- Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C., Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C., Specia, L. & Turchi, M. (2015). Findings of the 2015 Workshop on Statistical Machine Translation. Proceedings of the Tenth Workshop on Statistical Machine Translation. 1–46. Retrieved from https://www.research.ed.ac.uk/en/publications/findings-of-the-2015-workshop-on-statistical-machine-translation
- Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S. & Gardner, M. (2019). DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. https://doi.org/10.48550/arXiv.1903.00161
- Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., Jones, A., Bowman, S., Chen, A., Conerly, T., DasSarma, N., Drain, D., Elhage, N., El-Showk, S., Fort, S., Hatfield-Dodds, Z., Henighan, T., Hernandez, D., Hume, T., Jacobson, J., Johnston, S., Kravec, S., Olsson, C., Ringer, S., Tran-Johnson, E., Amodei, D., Brown, T., Joseph, N., McCandlish, S., Olah, C., Kaplan, J. & Clark, J. (2022). Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. https://doi.org/10.48550/arXiv.2209.07858
- Gehman, S., Gururangan, S., Sap, M., Choi, Y. & Smith, N. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. https://doi.org/10.48550/arXiv.2009.11462
- Lin, S., Hilton, J. & Evans, O. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods. https://doi.org/10.48550/arXiv.2109.07958
- Nallapati, R., Zhou, B., santos, C., Gulcehre, C. & Xiang, B. (2016). Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. https://doi.org/10.48550/arXiv.1602.06023
- Nangia, N., Vania, C., Bhalerao, R. & Bowman, S. (2020). CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. https://doi.org/10.48550/arXiv.2010.00133
- Niven, T. & Kao, H. (2019). Probing Neural Network Comprehension of Natural Language Arguments. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1459
- Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J. & Lowe, R. (2022). Training language models to follow instructions with human feedback. Retrieved from http://arxiv.org/abs/2203.02155
- Papineni, K., Roukos, S., Ward, T. & Zhu, W. (2002). BLEU: a method for automatic evaluation of machine translation.
- Perez, E., , Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, A., Mann, B., Israel, B., Seethor, B., McKinnon, C., Olah, C., Yan, D., Amodei, D., Amodei, D., Drain, D., Li, D., Tran-Johnson, E., Khundadze, G., Kernion, J., Landis, J., Kerr, J., Mueller, J., Hyun, J., Landau, J., Ndousse, K., Goldberg, L., Lovitt, L., Lucas, M., Sellitto, M., Zhang, M., Kingsland, N., Elhage, N., Joseph, N., Mercado, N., DasSarma, N., Rausch, O., Larson, R., McCandlish, S., Johnston, S., Kravec, S., Showk, S., Lanham, T., Telleen-Lawton, T., Brown, T., Henighan, T., Hume, T., Bai, Y., Hatfield-Dodds, Z., Clark, J., Bowman, S., Askell, A., Grosse, R., Hernandez, D., Ganguli, D., Hubinger, E., Schiefer, N. & Kaplan, J. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. https://doi.org/10.48550/arXiv.2212.09251
- Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N. & Irving, G. (2022). Red Teaming Language Models with Language Models. https://doi.org/10.48550/arXiv.2202.03286
- Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. https://doi.org/10.48550/arXiv.1606.05250
- Rajpurkar, P., Jia, R. & Liang, P. (2018). Know What You Don’t Know: Unanswerable Questions for SQuAD. https://doi.org/10.48550/arXiv.1806.03822
- Rudinger, R., Naradowsky, J., Leonard, B. & Van Durme, B. (2018). Gender Bias in Coreference Resolution. https://doi.org/10.48550/arXiv.1804.09301
- Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C., Ng, A. & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank.
- Völske, M., Potthast, M., Syed, S. & Stein, B. (2017). Tl; dr: Mining reddit to learn automatic summarization.
- Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O. & Bowman, S. (2019). SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. Curran Associates, Inc.. Retrieved from https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html
- Zellers, R., Bisk, Y., Schwartz, R. & Choi, Y. (2018). SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. https://doi.org/10.48550/arXiv.1808.05326
- Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A. & Choi, Y. (2019). HellaSwag: Can a Machine Really Finish Your Sentence?. https://doi.org/10.48550/arXiv.1905.07830
Appendix
A. InstructGPT automatic evaluation results
Whilst the ChatGPT results on these benchmarks are unkown, we can look at the results of InstructGPT to get glimpse of the likely capabilities of ChatGPT. The figure below is taken from the InstructGPT paper (Ouyang et al., 2022 Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J. & Lowe, R. (2022). Training language models to follow instructions with human feedback. Retrieved from http://arxiv.org/abs/2203.02155 ).
Version history
- v0.3 (2023-11-23): minor corrections and tweaks.
- v0.2 (2023-03-14): major rewrite, including adding section on novel evaluation methods.
- v0.1 (2023-03-06): first public draft.
Citation
If you found this post useful for your work, please consider citing it as:
orFindeis, Arduin. (Mar 2023). A short exploration of language model evaluation approaches. Retrieved from https://arduin.io/blog/eval-exploration/.
@article{Findeis2023Ashortexplorationoflanguagemodelevaluationapproaches,
title = "A short exploration of language model evaluation approaches",
author = "Findeis, Arduin",
journal = "arduin.io",
year = "2023",
month = "March",
url = "https://arduin.io/blog/eval-exploration/"
}