Model Card Explorer

88% of frontier benchmarks are reported by only one lab.

We extracted every benchmark from 82 public model cards across 6 frontier labs — 495 distinct benchmark names in total. Only 57 are shared between two or more labs. MMLU is the single benchmark that five labs agree to disclose.

Count:

(families: 90%)

How many labs report each benchmark?

Click a column to list the benchmarks in that group.

1 lab

2 labs

3 labs

4 labs

5 labs

6 labs

Number of labs reporting →

What each lab uniquely emphasizes

Benchmarks reported by only one lab, grouped by type. Bar length shows total count; segment colors show the category mix.

Safety

Other

Agentic

Coding

Reasoning

Multimodal

Knowledge

Arena / Preference

Multilingual

Math

Long Context

Instruction Following

Medical

Anthropic

184

Safety 35%

OpenAI

140

Safety 47%

Google DeepMind

Safety 41%

Meta AI

Reasoning 27%

xAI

Safety 93%

Mistral AI

Arena / Preference 64%

Anthropic and OpenAI concentrate their unique benchmarks on safety; Google spreads across multimodal and multilingual; Mistral reports only human-preference comparisons. No lab is grading itself on a scoreboard another lab uses.

We measure what labs publicly report in their model cards — not what they privately evaluate. Fragmentation of reporting does not imply concealment. Covers 6 Western frontier labs: Anthropic, Google, Meta, Mistral, OpenAI, xAI. Last updated from 82 public documents. See methodology for the family canonicalization rule.

Which safety topics each lab covers

A matrix across 6 labs and 15 categories, from child safety to political neutrality — how strongly each lab's public docs address each topic.