The benchmark problems reviving manual evaluation

There is a curious trend in machine learning (ML): researchers developing the most capable large language models (LLMs) increasingly evaluate them using manual methods such as red teaming. In red teaming, researchers hire workers to manually try to break the LLM in some form by interacting with it. Similarly, some users (including myself) pick their preferred LLM assistant by manually trying out various models – checking each LLM’s “vibe”. Given that LLM researchers and users both actively seek to automate all sorts of other tasks, red teaming and vibe checks are surprisingly manual evaluation processes....

January 15, 2024 · 18 min · Arduin Findeis

Interesting takes on the future of AI in 2023

In late 2022 and early 2023, we have seen major leaps in the field of AI with generative models like those underpinning the ChatGPT interface. Below are selected commentary pieces by AI researchers, leaders and commentators that make predictions about the future of AI following these leaps. I highly recommend looking at these links – they have been very influential on my own thinking and work. Geoffrey Hinton: The Godfather of AI on quitting Google to warn of AI risks (podcast) Bill Gates: The age of AI has begun (blog post) Ferenc Huszár: We May be Surprised Again: Why I take LLMs seriously (blog post) Ezra Klein: My view on AI (podcast) Sam Altman: OpenAI CEO on GPT-4, ChatGPT and the Future of AI (podcast) Andrej Karpathy: Emergence of a whole new computing paradigm (twitter post) Hope you find these helpful!...

May 17, 2023 · 1 min · Arduin Findeis

A short exploration of language model evaluation approaches

Disclaimer: This post collects some notes from exploring language model evaluation approaches in early 2023. It will (likely) be outdated by the time you read this. Improvement suggestions are welcome and can be sent to contact (at) arduin.io. 1. Introduction Language models (LMs) are notoriosly difficult to evaluate. Modern LMs are used for a wide variety of complex downstream tasks, such as text translation or conversation. This diversity of tasks means that no single metric can capture overall language model performance....

March 6, 2023 · 16 min · Arduin Findeis