Blog | arduin.io

What exactly was different about the Chatbot Arena version of Llama 4 Maverick?

An analysis using the Feedback Forensics app to detect the differences between the Chatbot Arena and the publicly released version of Llama 4 Maverick.

Some predictions for AI in 2025

As we enter the new year, I wanted to share some of my (tentative) predictions about AI development in 2025.

GAIA benchmark overview

The General AI Assistant (GAIA) benchmark by Mialon et al. (2023) aims to provide a “convenient yet challenging benchmark for AI assistants”. The benchmark consists of 466 questions, each requiring multiple reasoning steps to answer. Many questions require AI systems to use tools (web browser, code interpreter,…) and contain multi-modal input (images, videos, excel sheets,…). Whilst requiring advanced problem-solving capabilities to solve, GAIA’s tasks are simple and cheap to verify with unambiguous (and short) text answers. In this post, I give a short overview of the GAIA benchmark.

People

A short (non-exhaustive) list of people who put great stuff on the internet and consistently help me learn new things.

MMLU benchmark overview

The Massive Multitask Language Understanding (MMLU) benchmark (Hendrycks et al., 2021) is widely used to demonstrate state-of-the-art language model capabilities. Anthropic’s Claude 3, Google’s Gemini and OpenAI’s GPT-4 models were all introduced alongside prominently placed MMLU results. This publicity makes MMLU one of the most prominently discussed benchmarks for language models. Despite the benchmark’s prominence, the exact model capabilities evaluated and evaluation methods are less widely known. In this blog post, I aim to give a short overview of the MMLU benchmark.

GPT illustrated: from high-level diagram to vectors and code

There are many excellent explanations and illustrations of the generative pre-trained transformer (GPT) (Radford et al., 2018) and the original transformer architectures (Vaswani et al., 2017). For example, I can highly recommend the write-up by Turner (2023 ) and the video by Karpathy (2023). Nevertheless, I decided to create yet another illustration of GPT for a recent example class I taught. My illustration focuses on two things: (1) Provide a direct connection from a high-level diagram all the way to an actual code implementation of a GPT, and (2) make the illustration as simple as possible (avoiding unnecessary complexity, e.g. by focusing on vector instead of full matrix/tensor operations).

Recommendations for AI news aggregators

The field of artificial intelligence (AI) is moving fast. Keeping up with the vast number of new research papers on your own is daunting: scrolling through twitter all day does not seem like a good use of time. Below is a number of AI news aggregators that I personally use to get curated updates of AI developments on a regular basis without spending all my time looking at papers: Last week in AI (newsletter/podast) by Andrey Kurenkov, Jacky Liang et al: weekly updates on the latest AI research development. Also available as a podcast. Guide to AI (newsletter) by Nathan Benaich: in-depth monthly analysis of AI developments, from a commercial/startup/venture capital lense. Import AI (newsletter) by Jack Clark: provides a regular stream of very interesting research papers, with detailed summaries.

The benchmark problems reviving manual evaluation

There is a curious trend in machine learning (ML): researchers developing the most capable large language models (LLMs) increasingly evaluate them using manual methods such as red teaming. In red teaming, researchers hire workers to manually try to break the LLM in some form by interacting with it. Similarly, some users (including myself) pick their preferred LLM assistant by manually trying out various models – checking each LLM’s “vibe”. Given that LLM researchers and users both actively seek to automate all sorts of other tasks, red teaming and vibe checks are surprisingly manual evaluation processes. This trend towards manual evaluation hints at fundamental problems that prevent more automatic evaluation methods, such as benchmarks, to be used effectively for LLMs (Ganguli et al., 2023 Ganguli, D., Schiefer, N., Favaro, M. & Clark, J.(2023, 10). Retrieved from https://www.anthropic.com/index/evaluating-ai-systems ; La Malfa et al., 2023 La Malfa, E., Petrov, A., Frieder, S., Weinhuber, C., Burnell, R., Cohn, A., Shadbolt, N. & Wooldridge, M. (2023). The ARRT of Language-Models-as-a-Service: Overview of a New Paradigm and its Challenges. https://doi.org/10.48550/arXiv.2309.16573 ). In this blog post, I aim to give an illustrated overview of the problems preventing LLM benchmarks from being a fully satisfactory alternative to more manual approaches. ...

Interesting takes on the future of AI in 2023

In late 2022 and early 2023, we have seen major leaps in the field of AI with generative models like those underpinning the ChatGPT interface. Below are selected commentary pieces by AI researchers, leaders and commentators that make predictions about the future of AI following these leaps. I highly recommend looking at these links – they have been very influential on my own thinking and work. Geoffrey Hinton: The Godfather of AI on quitting Google to warn of AI risks (podcast) Bill Gates: The age of AI has begun (blog post) Ferenc Huszár: We May be Surprised Again: Why I take LLMs seriously (blog post) Ezra Klein: My view on AI (podcast) Sam Altman: OpenAI CEO on GPT-4, ChatGPT and the Future of AI (podcast) Andrej Karpathy: Emergence of a whole new computing paradigm (twitter post) Hope you find these helpful! ...

A short exploration of language model evaluation approaches

Disclaimer: This post collects some notes from exploring language model evaluation approaches in early 2023. It will (likely) be outdated by the time you read this. Improvement suggestions are welcome and can be sent to contact (at) arduin.io. ...