Model Card Explorer
89% of frontier benchmarks are reported by only one lab.
We extracted every benchmark from 78 public model cards across 6 frontier labs — 384 distinct benchmark names in total. Only 41 are shared between two or more labs. MMLU is the single benchmark that five labs agree to disclose.
How many labs report each benchmark?
Click a column to list the benchmarks in that group.
What each lab uniquely emphasizes
Benchmarks reported by only one lab, grouped by type. Bar length shows total count; segment colors show the category mix.
Anthropic and OpenAI concentrate their unique benchmarks on safety; Google spreads across multimodal and multilingual; Mistral reports only human-preference comparisons. No lab is grading itself on a scoreboard another lab uses.
We measure what labs publicly report in their model cards — not what they privately evaluate. Fragmentation of reporting does not imply concealment. Covers 6 Western frontier labs: Anthropic, Google, Meta, Mistral, OpenAI, xAI. Last updated from 78 public documents. See methodology for the family canonicalization rule.
Next →
Which safety topics each lab covers
A matrix across 6 labs and 15 categories, from child safety to political neutrality — how strongly each lab's public docs address each topic.