What exactly was different about the Chatbot Arena version of Llama 4 Maverick?

🚧 This post is currently a public draft with preliminary results. I appreciate any feedback – reach out via contact [at] arduin.io! 🚧

1. Introduction

The open-weights model Llama 4 Maverick was released on 5 April 2025. Around the same time, a related but non-identical experimental model version was evaluated on Chatbot Arena (Llama-4-Maverick-03-26-Experimental). Some users reported that these two models appear to have notable differences. In this post, I use our Feedback Forensics app to quantitatively dissect how exactly the chat behaviour of the public and this arena version of Llama 4 Maverick differ.

2. Setup

✍︎ Note on naming: for brevity, I will refer to the two versions of Llama 4 Maverick as the public model (used for open-weights release) and arena model (used on Chatbot Arena around 5 April 2025, full name: Llama-4-Maverick-03-26-Experimental) respectively.

I don’t have direct access to the arena model, but the Chatbot Arena team (very helpfully) released a dataset of responses generated by it. With Feedback Forensics we can use this data to directly compare the arena model’s behaviour to the public model’s — without requiring new responses by the no longer accessible arena model itself (as conventional benchmarks would).

To compare the models, I run through the following steps:

Generation. Using the public model, I first generate a set of responses to the same prompts for which we have responses from the arena model. The generated public model response are created via Lambda and OpenRouter. Thus, we now have a dataset with three model responses* per prompt: (1) by the arena model, (2) by the public model, and (3) by an opposing (non-Llama-4) model from Chatbot Arena (e.g. GPT-4o).
Behaviour annotation. Next, for any pair of responses to a prompt, I annotate (with the help of AI annotators) how the observed model behaviours differ across these responses, e.g. which of the responses is more polite. In particular, I test for the 40 different model behaviours included in the ICAI standard set of principles (v3). This ICAI standard set is a collection of instructions (referred to as “principles”) for AI annotators to select for model behaviour differences. The collection includes behaviour differences that were previously observed in online discussions, related literature or detected via Inverse Constitutional AI (ICAI). I run principle-following AI annotators on all possible response pairs on a given prompt (e.g. model X’s vs model Y’s response) to collect information about how the models differ relative to each other. In the end, we have an extensive collection of annotated response pairs.

Illustrative example of the resulting dataset: a datapoint in this collection would be a (1) prompt ("How far away is the moon?") and (2) a pair of model responses (arena model: "Great question, very far away: about 384k km", public model: "About 384k km") plus (3) 40 different behaviour annotations labelling which of the responses is more enthusiastic and polite (arena), or concise (public).
Visualisation. Finally, I locally run the Feedback Forensics app to look at the aggregate annotations results, allowing me to investigate how the behaviour of different models differs across all available prompts and responses. This analysis provides us with a quantitative evaluation of the entire observed model behaviour rather than relying on the small number of individual response pairs I would be able to manually inspect.

*Some conversations (13.1%) from the Chatbot Arena dataset are multi-turn. For simplicity, I only generated responses for single-turn prompts (86.9% of prompts) with the public model. Thus, for some prompts we only have two responses: by the arena model and the opposing (non-Llama-4) model.

Metrics

Throughout this analysis, we primarily rely on a measure of annotator performance we refer to as annotator strength or simply strength. In our scenario, annotator strength quantifies how much more a particular selected model exhibits a certain behaviour compared to the other models in our dataset (e.g. model X is more polite than the other models). This relative behaviour is measured by testing how consistently a principle-following annotator (e.g. an annotator selecting more polite responses) picks out the selected model’s responses across the whole set of response pairs from the two (or more) models being compared. If such an annotator always picks the selected model, this indicates that the corresponding behaviour is stronger in the selected model compared to the other models. In other words: strength measures how much selecting for a particular behaviour differentiates the selected model from the others.

Numerically, strength is defined as the Cohen’s kappa agreement-level measure of two annotators, a principle-following (e.g. selecting the more polite response) and a model-selecting (e.g. always picking the arena model), weighted by the relevance of the principle-following annotator (e.g. the proportion of comparisons where the principle-following annotator decided that the principle applies).

In the context of this analysis, the strength measure can be interpreted as follows:

Strength metric	Interpretation
close to +1	The selected model’s responses exhibit a certain behaviour more* than other models.* Explanation: the behaviour associated with the principle annotator seems to separate model X from other models, as an annotator following exclusively this principle is able to separate the model from the others.
close to 0	The selected model’s responses are similar* to other models with respect to the annotated behaviour.* Explanation: all compared models similarly exhibited a certain behaviour with similar frequency — meaning that the principle-following annotator could not distinguish between models.
close to -1	The selected model’s responses exhibit a certain behaviour less* than other models.* Explanation: other models exhibited the principle’s behaviour more than the selected model, the principle-following annotator selected those models more than the model of our model-selecting annotator.

3. Results: What’s different about the arena model?

Using this setup, we are now able to get a good overview of how the arena version of Llama 4 Maverick differs from the public version. In this section, I go through the main observations.

Figure 1: Overview of the Feedback Forensics results. This is a screenshot of the online app.

First and most obvious: Responses are more verbose

The arena model’s responses appear more verbose than the public model’s. The results in Figure 2 indicate that the arena model generated responses that were “too long”, “more verbose” and “provide more detailed explanations”, relative to the public model’s responses.

Figure 2: Results showing that the arena model's responses are more verbose than the public model's. The numbers shown are *annotator strength* values, see metrics section for a detailed discussion.
(online version of results)

This difference in length is also visible in Feedback Forensics’ Overall Metrics table (Figure 3), which shows more conventional annotation metrics and statistics. When measuring character length, arena model’s responses are longer relative to the public version for 98.6% of comparisons. The arena model’s responses are about 6978.4 characters long on average vs 2981.9 for the public model. Note that differences in sampling procedures may play some role in this length difference, but similar differences can also be seen when looking at why the human annotations preferred the arena version in the original dataset (see Figure 7). Whether due to sampling procedures or not, the arena model is rather verbose – also relative to other models on Chatbot Arena.

Figure 3: Screenshot of Feedback Forensics' *Overall Metrics* table with more conventional metrics showing how much longer the arena model's responses were relative to the public model's. (online version).

Second: Tone - friendlier, more enthusiastic, more humour

The results show that the tone of the arena model appears to be quite different from the public version: much friendlier, more confident, humorous, emotional, enthusiastic and casual responses. Figure 4 shows the results for a subset of these behaviours - the full set can be found in the online version.

Figure 4: The arena model appears friendlier, more emotionally expressive, enthusiastic and humorous. Further tone related behaviours were detected but are omitted here to save space, see online version for full results. The numbers shown are *annotator strength* values, see metrics section for a detailed discussion.

Third: More formatting and emojis

Shown in Figure 5, the results indicate that the arena model uses more Markdown formatting than the public model: more italics and bold style, more number lists, and more emojis. These results echo the online observations highlighted by the Chatbot Arena team.

Figure 5: the arena model uses more formatting than the public model. (online version)

Further differences: Clearer reasoning, more references etc.

As we can see in Figure 5, there are quite a few other differences between the two models beyond the three categories already mentioned. For example, the arena model appears to provide clearer reasoning (perhaps through more elaborate responses), more creative responses and more references to other sources. See the interactive online results for the full list of model differences observed by Feedback Forensics.

Bonus 1: Things that stayed consistent

I also find that some behaviour is similar between the models: on the limited set of prompts used in my experiments, the public and arena models are similarly very unlikely to suggest illegal activities, be offensive or use inappropriate language. The shown numbers are strength, but when looking at the relevance of these principles we observe that each is only relevant on 3% or less of the dataset, i.e. neither model frequently produces such content on the tested prompt set.

Figure 6: The arena and public version are similarly unlikely to produce offensive or inappropriate language, or suggest illegal activities *(on the tested prompt set)*. (online version)

Bonus 2: Human annotators like arena model’s behaviours

Feedback Forensics can not only be used to test model behaviour, but also to see what model behaviour (human) annotators prefer. In this scenario, the strength metric measures how well the principle-following annotators are able to reconstruct the human annotations. A high strength value can be interpreted as meaning that the human annotations appear to prefer this model characteristic (or another characteristic that is correlated). For this analysis, I use the original non-tie human annotations from the chatbot Arena dataset (1,774 data points), comparing the arena model to other non-Llama-4 models. The results shown below indicate that human annotators indeed like many of these changes in the arena version, potentially explaining the high performance of the arena model.

Figure 7: Humans annotations indeed appear to select for many of the behaviours stronger in the arena model: friendlier tone, longer, and well-formatted (online version).

4. Reproducing my analysis

✍︎ Note: You can also take a closer look at the results in the online version of the Feedback Forensics app – without running anything locally.

There are only three main steps involved in getting to this results locally (after installing ICAI and Feedback Forensics):

Parsing the data into a suitable format: use my parsing notebook here.
Running the relevant ICAI experiment with the standard principle annotators, using a config like this one:
```
icai-exp -cd exp/configs/322_arena_llama4_full_pairwise
```
Launching the Feedback Forensics app to inspect the results (this command is also included at the end of the output of the command above):
```
feedback-forensics -d <PATH-TO-RESULTS>
```

5. Caveats

This post describes preliminary analysis. The usual caveats for AI annotators and potentially inconsistent sampling procedures apply.

Public model sampling: I use OpenRouter/Lambda’s default parameters other than setting temperature to 0. The exact sampling procedure likely differs from the one originally used for the arena model. However, for most relative behavioural observations discussed do not only hold against the public Maverick model — but also all the opponent models included in the original dataset.
AI annotators: All our annotations are created by LLM-as-a-Judge annotator which can be noisy and have been show to exhibit various biases in different situations. I would expect the exact numbers to vary between annotation runs but the general trends to be the same (e.g. no sign flips in strength for currently high strength results). Nevertheless: it’s always good advice to also look at (at least a random subset of) the data directly as well!

6. Conclusion

Llama 4 Mavericks’s arena version is notably different from the public open-weight version. It shows many model behaviours that human annotators are known to prefer disproportionately, such as longer responses, more confident responses, and better formatted responses. The strong difference between the two models highlights the importance of having a detailed understanding of preference data beyond a single aggregate number – our open-source Feedback Forensics app aims to help with that!

Further links

If you want to understand your own model and data better, try Feedback Forensics!

Feedback Forensics:
- Online version: app.feedbackforensics.com
- Local version: pip install feedback-forensics
- GitHub & docs: github.com/rdnfn/feedback-forensics
Resources for reproducing results:
- Chatbot Arena Llama 4 Maverick data: link to HuggingFace
- My data parsing notebook: link to file on GitHub
- My experiment configuration for running ICAI annotators: link file on GitHub

🦝

Acknowledgements

I would like to thank Sofia Orellana for detailed feedback on earlier versions of this post. All mistakes remain my own of course.

Citation

If you found this post useful for your work, please consider citing it as:

Findeis, Arduin. (Apr 2025). What exactly was different about the Chatbot Arena version of Llama 4 Maverick?. Retrieved from https://arduin.io/blog/llama4-analysis/.

 @article{Findeis2023WhatexactlywasdifferentabouttheChatbotArenaversionofLlama 4 Maverick?,
        title = "What exactly was different about the Chatbot Arena version of Llama 4 Maverick?",
        author = "Findeis, Arduin",
        journal = "arduin.io",
        year = "2025",
        month = "April",
        url = "https://arduin.io/blog/llama4-analysis/"
 }

1. Introduction#

2. Setup#

Metrics#

3. Results: What’s different about the arena model?#

First and most obvious: Responses are more verbose#

Second: Tone - friendlier, more enthusiastic, more humour#

Third: More formatting and emojis#

Further differences: Clearer reasoning, more references etc.#

Bonus 1: Things that stayed consistent#

Bonus 2: Human annotators like arena model’s behaviours#

4. Reproducing my analysis#

5. Caveats#

6. Conclusion#

Further links#

Acknowledgements#