Model Card Explorer

89% of frontier benchmarks are reported by only one lab.

We extracted every benchmark from 78 public model cards across 6 frontier labs — 384 distinct benchmark names in total. Only 41 are shared between two or more labs. MMLU is the single benchmark that five labs agree to disclose.

Count:
(families: 90%)

How many labs report each benchmark?

Click a column to list the benchmarks in that group.

1 lab
2 labs
3 labs
4 labs
5 labs
6 labs
Number of labs reporting →

What each lab uniquely emphasizes

Benchmarks reported by only one lab, grouped by type. Bar length shows total count; segment colors show the category mix.

Safety
Agentic
Other
Multimodal
Coding
Reasoning
Knowledge
Multilingual
Arena / Preference
Long Context
Math
Instruction Following
Medical
OpenAI
129
Safety 61%
Anthropic
123
Safety 43%
Google DeepMind
53
Safety 34%
Meta AI
16
Reasoning 25%
xAI
15
Safety 93%
Mistral AI
7
Arena / Preference 100%

Anthropic and OpenAI concentrate their unique benchmarks on safety; Google spreads across multimodal and multilingual; Mistral reports only human-preference comparisons. No lab is grading itself on a scoreboard another lab uses.

We measure what labs publicly report in their model cards — not what they privately evaluate. Fragmentation of reporting does not imply concealment. Covers 6 Western frontier labs: Anthropic, Google, Meta, Mistral, OpenAI, xAI. Last updated from 78 public documents. See methodology for the family canonicalization rule.

Next →

Which safety topics each lab covers

A matrix across 6 labs and 15 categories, from child safety to political neutrality — how strongly each lab's public docs address each topic.