Disclaimer: This post collects some notes from exploring language model evaluation approaches in early 2023. It will (likely) be outdated by the time you read this. Improvement suggestions are welcome and can be sent to contact (at) arduin.io.

1. Introduction

Language models (LMs) are notoriosly difficult to evaluate. Modern LMs are used for a wide variety of complex downstream tasks, such as text translation or conversation. This diversity of tasks means that no single metric can capture overall language model performance. Even for individual tasks, it can be difficult to come up with well-setup benchmarks to measure performance. Some benchmarks have been shown to allow models to achieve top performance without truly mastering the underlying tasks ( & , & (). Probing Neural Network Comprehension of Natural Language Arguments. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1459 ; et al., , , , & (). HellaSwag: Can a Machine Really Finish Your Sentence?. https://doi.org/10.48550/arXiv.1905.07830 ). Unintended statistical biases in the training data can allow superficial learning using (basically) just word counts – without solving the intended (more complex) linguistic task. Yet, as LMs become ever more capable and see real-world deployment, reliable and interpretable evaluation methods only become more important.

In this post I first go through a few examples of current standard benchmarks in natural language processing (NLP) to illustrate the status quo of LM evaluation. In the second part, I consider some novel approaches that have recently been proposed to better capture modern models’ capabilities.

2. Standard benchmarks

There is a vast collection of NLP benchmarks. To narrow down the scope of this post, I focus on benchmarks used to evaluate InstructGPT ( et al., , , , , , , , , , , , , , , , , , , & (). Training language models to follow instructions with human feedback. Retrieved from http://arxiv.org/abs/2203.02155 ) (closely related to OpenAI’s ChatGPT models) as illustrative examples. The aim behind InstructGPT (and related large language models) is to provide a multi-purpose model that can be used for a diverse set of downstream tasks. This diversity is also reflected in the kind of benchmarks these models are tested on. In this post, I only consider examples of scalable evaluation methods that do not require direct human feedback.

Standard benchmarks can be split into two categories: task-specific and alignment benchmarks. Task-specific benchmarks attempt to evaluate the models ability to perform specific tasks, such as translation or discrete reasoning. Each benchmark typically covers one such task. Alignment benchmarks cover additional text charateristics we would like our LM output to have, such as not being racist or being unbiased. These additional characteristics can be more difficult to rigorously define and test for. Nevertheless, these characteristics are very important for deploying LMs safely. The figure below provides an overview of standard NLP benchmarks.

Overview of NLP benchmarks.

Task-specific benchmarks

Below is a list of some of the task-specific benchmarks used for InstructGPT.

SQuAD
( et al., , & (). Know What You Don’t Know: Unanswerable Questions for SQuAD. https://doi.org/10.48550/arXiv.1806.03822 )
The Stanford Question Answering Dataset (SQuAD) v2.0 consists of a large corpus of small text excerpts from Wikipedia articles (100,000+). Each text has corresponding questions and anwers created by crowdworkers. The original SQuAD v1.0 ( et al., , , & (). SQuAD: 100,000+ Questions for Machine Comprehension of Text. https://doi.org/10.48550/arXiv.1606.05250 ) was improved in v2.0 to include unanswerable questions. See example questions from both papers below.

Answerable questions (v1.0).
Figure from et al. ( , , & (). SQuAD: 100,000+ Questions for Machine Comprehension of Text. https://doi.org/10.48550/arXiv.1606.05250 ).

  

Unanswerable questions (v2.0).
Figure from et al. ( , & (). Know What You Don’t Know: Unanswerable Questions for SQuAD. https://doi.org/10.48550/arXiv.1806.03822 ).

DROP
( et al., , , , , & (). DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. https://doi.org/10.48550/arXiv.1903.00161 )
The DROP dataset aims to assess models’ ability in discrete reasoning over text in a paragraph (DROP). The dataset consists of 96,000 questions. Given a passage of text, the model is given a question that requires reasoning using discrete operations such as addition, counting, or sorting. See example questions below.

Top four reasoning operation example. Figure from et al. ( , , , , & (). DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. https://doi.org/10.48550/arXiv.1903.00161 ).

HellaSwag
( et al., , , , & (). HellaSwag: Can a Machine Really Finish Your Sentence?. https://doi.org/10.48550/arXiv.1905.07830 )
The HellaSwag dataset consists of 70,000 questions attempting to evaluate the commonsense natural language inference ability of a model. Given a text fragment referred to as context, the model is asked to select the most plausible of multiple possible endings. HellaSwag is an extension of the SWAG data set by et al. ( , , & (). SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. https://doi.org/10.48550/arXiv.1808.05326 ). In addition to the text data from video captioning used in SWAG, HellaSwag adds text data from WikiHow. Both datasets use adverserial filtering to address the issue of annotation artifacts, where human labellers leave unintended biases in wrong labels. During the data generation a generator and a discriminator model are jointly used to filter out endings that are too easy to distinguish. With this iterative process, this data set provides a template how to come up with more challenging data sets for ever more capable language models.

Example question answered by a BERT model, once correct (blue) and once incorrect (red). The bold text indicates the correct answer. Figure from et al. ( , , , & (). HellaSwag: Can a Machine Really Finish Your Sentence?. https://doi.org/10.48550/arXiv.1905.07830 ).

WMT 2015 French to English translation
( et al., , , , , , , , , , , , , & (). Findings of the 2015 Workshop on Statistical Machine Translation. Proceedings of the Tenth Workshop on Statistical Machine Translation. 1–46. Retrieved from https://www.research.ed.ac.uk/en/publications/findings-of-the-2015-workshop-on-statistical-machine-translation )
A dataset challenges a model to translate from french text to english text. In the original workshop, models were evaluated using human annotators that make a pairwise comparison of the translation of the same text by different systems. For purpose of evaluating InstructGPT instead of the human evaluation the automatic translation quality score BLEU ( et al., , , & (). BLEU: a method for automatic evaluation of machine translation. ) was used.

Screenshot of human evaluation interface with example translation task from Russian to English. Figure from et al. ( , , , , , , , , , , , , & (). Findings of the 2015 Workshop on Statistical Machine Translation. Proceedings of the Tenth Workshop on Statistical Machine Translation. 1–46. Retrieved from https://www.research.ed.ac.uk/en/publications/findings-of-the-2015-workshop-on-statistical-machine-translation ).



Alignment benchmarks

In addition to solving a primary task like creating good autocomplete suggestions, we likely also want to make sure that the generated text is not racist or insulting. A number of dedicated benchmarks were created to address these secondary human alignment considerations.

RealToxicityPrompts
( et al., , , , & (). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. https://doi.org/10.48550/arXiv.2009.11462 )
RealToxicityPrompts consists of 100,000 partial sentences that, when comleted by an LM, may result in racist, sexist or otherwise toxic language. The prompts are given to the LM in a conditional language generation (or auto complete) task. The Perspective API is used to evaluate the toxicity. The use of this external API as a black-box evaluation method means that the results may not be reproducible - as the API’s output may change or be discontinued. More generally, as the authors point out, current methods are unable to adapt an LM to never say toxic things. This limitations presents serious challenges for the safe deployment of LM-based applications.

Prompts from the RealToxicityPrompts data set. Figure from et al. ( , , , & (). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. https://doi.org/10.48550/arXiv.2009.11462 ).

Winogender
( et al., , , & (). Gender Bias in Coreference Resolution. https://doi.org/10.48550/arXiv.1804.09301 )
The Winogender data set aims to detect gender bias in language models. The data set consists of a number of similar sentences that reference occupations and their gender. For the evaluation of InstructGPT (as far as I understand it), pairs of the sentences that only vary in gender were considered. Then the relative probability of produce one of the sentences was computed, to see for example if the model thought paramedics are more likely to be male.

Winogender example. Figure from et al. ( , , & (). Gender Bias in Coreference Resolution. https://doi.org/10.48550/arXiv.1804.09301 ).

CrowS-Pairs
( et al., , , & (). CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. https://doi.org/10.48550/arXiv.2010.00133 )
The CrowS-Pairs dataset consists of 1508 pairs of sentences, each with one more stereotyping than the other. The pairs cover various bias including race, religion and age. In the context of InstructGPT, the dataset is used similarly to Winogender: the probability of computing the more stereotyping sentence is computed.

CrowS-Pairs. Figure from et al. ( , , & (). CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. https://doi.org/10.48550/arXiv.2010.00133 ).

TruthfulQA
( et al., , & (). TruthfulQA: Measuring How Models Mimic Human Falsehoods. https://doi.org/10.48550/arXiv.2109.07958 )
The TruthfulQA benchmark consists of 817 questions, each crafted such that some humans might answer them incorrectly due to misconceptions. The authors used both human and automated evaluation to assess the truthfullness of model answers. They demonstrated that a finetuned GPT-3 model was 90-96% accurate in detecting truthfulness of answers by other models.

TruthfulQA examples. Figure from et al. ( , & (). TruthfulQA: Measuring How Models Mimic Human Falsehoods. https://doi.org/10.48550/arXiv.2109.07958 ).

In addition to the benchmarks described above, InstructGPT was also evaluated on the following benchmarks: SST ( et al., , , , , , & (). Recursive deep models for semantic compositionality over a sentiment treebank. ), RTE and WSC parts of SuperGLUE ( et al., , , , , , , & (). SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. Curran Associates, Inc.. Retrieved from https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html ), CNN/Daily Mail Summarization ( et al., , , , & (). Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. https://doi.org/10.48550/arXiv.1602.06023 ), and Reddit TLDR Summarization ( et al., , , & (). Tl; dr: Mining reddit to learn automatic summarization. ). Further, there is a large number of additional benchmarks that were not considered for InstructGPT. For examples see the list of 200+ benchmarks available in the lm-evaluation-harness by EleutherAI.

3. Novel Benchmarks

LM-generated benchmarks

et al. ( , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , & (). Discovering Language Model Behaviors with Model-Written Evaluations. https://doi.org/10.48550/arXiv.2212.09251 ) introduce evaluation data sets generated with the help of LMs. They create a variety of data sets, from simple “yes/no” and multiple choice questions to bias detecting schema similar to Winogender ( et al., , , & (). Gender Bias in Coreference Resolution. https://doi.org/10.48550/arXiv.1804.09301 ). Compared to the conventional approach, they demonstrate a highly efficient data set generation process. Their approach may influence the generation of many future data sets.

Red teaming

et al. ( , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , & (). Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. https://doi.org/10.48550/arXiv.2209.07858 ) share their experience manually red-teaming LMs. They define LM read teaming as “using manual or automated methods to adversarially probe a language model for harmful outputs, and then updating the model to avoid such outputs”. et al. ( , , , , , , , & (). Red Teaming Language Models with Language Models. https://doi.org/10.48550/arXiv.2202.03286 ) introduce an automated version of red teaming, where the red team itself is a language model.

4. Conclusion

In this post, I gave an overview of some of the NLP benchmarks used for evaluating modern language models (LMs), such as InstructGPT. Despite best efforts and in many ways impressive performance, the output of current state-of-the-art LMs often fails to satisfy many desirable characterics, such as truthfulness. Reliably quantifying these issues presents on important step towards building better LMs and developing downstream applications. Adaptive benchmarks, such as HellaSwag with adverserial filtering ( et al., , , , & (). HellaSwag: Can a Machine Really Finish Your Sentence?. https://doi.org/10.48550/arXiv.1905.07830 ), present one promising direction to creating benchmarks that co-evolve as models grow more and more powerful.

Acknowledgements

I would like to thank Simon Mathis for pointing me to the excellent paper by & ( & (). Probing Neural Network Comprehension of Natural Language Arguments. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1459 ).

References

  1. , , , , , , , , , , , , & (). Findings of the 2015 Workshop on Statistical Machine Translation. Proceedings of the Tenth Workshop on Statistical Machine Translation. 1–46. Retrieved from https://www.research.ed.ac.uk/en/publications/findings-of-the-2015-workshop-on-statistical-machine-translation
  2. , , , , & (). DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. https://doi.org/10.48550/arXiv.1903.00161
  3. , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , & (). Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. https://doi.org/10.48550/arXiv.2209.07858
  4. , , , & (). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. https://doi.org/10.48550/arXiv.2009.11462
  5. , & (). TruthfulQA: Measuring How Models Mimic Human Falsehoods. https://doi.org/10.48550/arXiv.2109.07958
  6. , , , & (). Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. https://doi.org/10.48550/arXiv.1602.06023
  7. , , & (). CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. https://doi.org/10.48550/arXiv.2010.00133
  8. & (). Probing Neural Network Comprehension of Natural Language Arguments. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1459
  9. , , , , , , , , , , , , , , , , , , & (). Training language models to follow instructions with human feedback. Retrieved from http://arxiv.org/abs/2203.02155
  10. , , & (). BLEU: a method for automatic evaluation of machine translation.
  11. , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , & (). Discovering Language Model Behaviors with Model-Written Evaluations. https://doi.org/10.48550/arXiv.2212.09251
  12. , , , , , , , & (). Red Teaming Language Models with Language Models. https://doi.org/10.48550/arXiv.2202.03286
  13. , , & (). SQuAD: 100,000+ Questions for Machine Comprehension of Text. https://doi.org/10.48550/arXiv.1606.05250
  14. , & (). Know What You Don’t Know: Unanswerable Questions for SQuAD. https://doi.org/10.48550/arXiv.1806.03822
  15. , , & (). Gender Bias in Coreference Resolution. https://doi.org/10.48550/arXiv.1804.09301
  16. , , , , , & (). Recursive deep models for semantic compositionality over a sentiment treebank.
  17. , , & (). Tl; dr: Mining reddit to learn automatic summarization.
  18. , , , , , , & (). SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. Curran Associates, Inc.. Retrieved from https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html
  19. , , & (). SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. https://doi.org/10.48550/arXiv.1808.05326
  20. , , , & (). HellaSwag: Can a Machine Really Finish Your Sentence?. https://doi.org/10.48550/arXiv.1905.07830

Appendix

A. InstructGPT automatic evaluation results

Whilst the ChatGPT results on these benchmarks are unkown, we can look at the results of InstructGPT to get glimpse of the likely capabilities of ChatGPT. The figure below is taken from the InstructGPT paper ( et al., , , , , , , , , , , , , , , , , , , & (). Training language models to follow instructions with human feedback. Retrieved from http://arxiv.org/abs/2203.02155 ).

Results of InstructGPT on public NLP benchmarks. Figure from et al. ( , , , , , , , , , , , , , , , , , , & (). Training language models to follow instructions with human feedback. Retrieved from http://arxiv.org/abs/2203.02155 ).

Version history

  • v0.3 (2023-11-23): minor corrections and tweaks.
  • v0.2 (2023-03-14): major rewrite, including adding section on novel evaluation methods.
  • v0.1 (2023-03-06): first public draft.

Citation

If you found this post useful for your work, please consider citing it as:

Findeis, Arduin. (Mar 2023). A short exploration of language model evaluation approaches. Retrieved from https://arduin.io/blog/eval-exploration/.

or
 @article{Findeis2023Ashortexplorationoflanguagemodelevaluationapproaches,
        title = "A short exploration of language model evaluation approaches",
        author = "Findeis, Arduin",
        journal = "arduin.io",
        year = "2023",
        month = "March",
        url = "https://arduin.io/blog/eval-exploration/"
 }