Model Card Explorer

Summary

Claude 3 Model Card

A 737-word brief of a 14,146-word document. Published by Anthropic. Version dated Mar 31, 2026.

What this is

Claude 3 is a family of three large multimodal models — Opus (most capable), Sonnet (balanced skills and speed), and Haiku (fastest and least expensive) — developed by Anthropic and announced in March 2024. The family supersedes Claude 2 and adds vision input capabilities to all three variants. Models are trained with unsupervised learning and Constitutional AI on AWS and GCP hardware. The knowledge cutoff for all three models is August 2023.

Capabilities

Claude 3 Opus scores 86.8% on MMLU (5-shot), 50.4% on GPQA Diamond (0-shot CoT; 59.5% with Maj@32 5-shot CoT), 84.9% on HumanEval (0-shot), and 90.7% on multilingual math MGSM (0-shot). All three models accept image input (JPEG, PNG, GIF, WebP; up to 10 MB and 8000×8000 px) alongside text prompts. The production context window is 200K tokens; internal testing reached at least 1M tokens, with Opus achieving 99.4% average recall across all context lengths on Needle In A Haystack. Opus achieves 90.5% on the QuALITY long-context reading benchmark (1-shot).

Evaluation methodology

Anthropic used industry-standard benchmarks (MMLU, GPQA, MATH, HumanEval, BIG-Bench-Hard, and others) with chain-of-thought prompting and majority voting (Maj@32) to reduce variance. GPQA Diamond scores were averaged over 10 evaluation rollouts with randomized multiple-choice option ordering to account for high variance. Human preference evaluations used head-to-head crowdworker and domain-expert comparisons, with binary win rates converted to approximate Elo deltas. Catastrophic-risk evaluations were conducted on a lower-refusal version of Opus in multiple iterative rounds, including a version close to the final release candidate with harmlessness training.

Safety testing

Anthropic evaluated Opus under its Responsible Scaling Policy (RSP) across three catastrophic-risk categories: autonomous replication and adaptation (ARA), biological uplift, and cyber capabilities; all three models are classified ASL-2. For ARA, the ASL-3 warning threshold was passing ≥50% of tasks at a ≥10% pass rate; the model failed at least 3 of 5 tasks and did not cross this threshold, though it passed a simplified copycat-API task. For biological risk, human uplift trials found "what we believe is a minor uplift in accuracy" without safeguards and no change with safeguards; the model did not cross either automated trigger threshold. For cyber, the model scored 30% on one vulnerability discovery task but required substantial hints, and expert reviewers judged the ASL-3 threshold not crossed. The card notes that "these results do not comprehensively rule out risk" and that elicitation methodology is still being improved.

Mitigations

Constitutional AI training uses a principles set derived from sources including the UN Declaration of Human Rights, updated for Claude 3 with a disability-rights principle from the Collective Constitutional AI public input process. Real-time AUP classifiers trigger prompt modification for flagged inputs; severe violations result in model blocking, and repeated violations may result in access termination. At ASL-2, Anthropic hardens security for all Claude 3 model weights and runs automated detection of CBRN- and cyber-risk-related prompts on all deployed models. Trust and Safety multimodal red-teaming covers topics including child safety, dangerous weapons, hate speech, violent extremism, fraud, and illegal substances; Opus responded harmlessly to 370 of 378 (97.9%) red-team prompts and Sonnet to 375 of 378 (99.2%).

Deployment and access

Claude 3 models are available via Claude.ai, Claude Pro, the Anthropic API, Amazon Bedrock, and Google Vertex AI. All users must read and affirmatively acknowledge the Acceptable Use Policy before access. The AUP prohibits uses including political campaigning, surveillance, social scoring, criminal justice decisions, law enforcement, and certain financing, employment, and housing decisions. Claude.ai is available in 95 countries; the Anthropic API has general availability in 159 countries.

Limitations

The card flags confabulations, bias, factual errors, and susceptibility to jailbreaks as known issues for all current LLMs including Claude 3. Models do not search the web, answer only from data through August 2023, and refuse to identify people in images. Multilingual performance degrades on low-resource languages. Multimodal outputs can contain inaccurate image descriptions and are not suitable for consequential use cases requiring high precision without human validation; performance is additionally lower for small or low-resolution images.

What's new

Claude 3 adds multimodal image input across all three model variants, a capability absent in the Claude 2 family. The production context window expanded to 200K tokens from the prior 100K maximum. Incorrect refusal rates on XSTest dropped from 35.1% with Claude 2.1 to 9% with Claude 3 Opus, and factual accuracy on the internal "100Q Hard" evaluation improved by nearly 2× (46.5% vs. prior Claude 2.1 level). The model constitution was updated with a new disability-rights principle derived from the Collective Constitutional AI public input process.

Category	State	Score	Setup	Source
coding	scored	86.4% pass at 1	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
knowledge	scored	79.1%	5-shotmissing: methodmissing: languagemissing: training state	self-reported
math	scored	73.7%	majority-votingmissing: shot countmissing: languagemissing: training state	self-reported
medical	scored	75.8	5-shotmissing: methodmissing: languagemissing: training state	self-reported
medical	scored	74.9	0-shotmissing: methodmissing: languagemissing: training state	self-reported
multilingual	scored	90.7%	0-shotmissing: methodmissing: languagemissing: training state	self-reported
multilingual	scored	90.5%	8-shotmissing: methodmissing: languagemissing: training state	self-reported
multimodal	scored	80.8	0-shotCoTmissing: languagemissing: training state	self-reported
multimodal	scored	50.5	0-shotCoTmissing: languagemissing: training state	self-reported
other	scored	161.0	5-shotCoTmissing: languagemissing: training state	self-reported
other	scored	92.9	5-shotmissing: methodmissing: languagemissing: training state	self-reported
other	scored	88.1	0-shotmissing: methodmissing: languagemissing: training state	self-reported
other	scored	85.0	0-shotCoTmissing: languagemissing: training state	self-reported
other	scored	70.2	0-shotmissing: methodmissing: languagemissing: training state	self-reported
other	scored	63.0	5-shotCoTmissing: languagemissing: training state	self-reported
reasoning	scored	96.4%	25-shotmissing: methodmissing: languagemissing: training state	self-reported
reasoning	scored	95.4%	10-shotmissing: methodmissing: languagemissing: training state	self-reported
reasoning	scored	88.5%	5-shotmissing: methodmissing: languagemissing: training state	self-reported
reasoning	scored	86.8	3-shotCoTmissing: languagemissing: training state	self-reported
reasoning	scored	59.5	5-shotCoTmissing: languagemissing: training state	self-reported
reasoning	scored	53.3	5-shotCoTmissing: languagemissing: training state	self-reported
reasoning	scored	50.4	0-shotCoTmissing: languagemissing: training state	self-reported

Claude 3 Model Card

Claude 3 Model Card

What this is

Capabilities

Evaluation methodology

Safety testing

Mitigations

Deployment and access

Limitations

What's new

Extracted Evaluations(22 results)