Model Cards / Anthropic

Claude 3 Model Card

model card14,146 words·62 min read·Mar 31, 2026·Source
Summary

Claude 3 Model Card

A 737-word brief of a 14,146-word document. Published by Anthropic. Version dated Mar 31, 2026.
01

What this is

Claude 3 is a family of three large multimodal models — Opus (most capable), Sonnet (balanced skills and speed), and Haiku (fastest and least expensive) — developed by Anthropic and announced in March 2024. The family supersedes Claude 2 and adds vision input capabilities to all three variants. Models are trained with unsupervised learning and Constitutional AI on AWS and GCP hardware. The knowledge cutoff for all three models is August 2023.

02

Capabilities

Claude 3 Opus scores 86.8% on MMLU (5-shot), 50.4% on GPQA Diamond (0-shot CoT; 59.5% with Maj@32 5-shot CoT), 84.9% on HumanEval (0-shot), and 90.7% on multilingual math MGSM (0-shot). All three models accept image input (JPEG, PNG, GIF, WebP; up to 10 MB and 8000×8000 px) alongside text prompts. The production context window is 200K tokens; internal testing reached at least 1M tokens, with Opus achieving 99.4% average recall across all context lengths on Needle In A Haystack. Opus achieves 90.5% on the QuALITY long-context reading benchmark (1-shot).

03

Evaluation methodology

Anthropic used industry-standard benchmarks (MMLU, GPQA, MATH, HumanEval, BIG-Bench-Hard, and others) with chain-of-thought prompting and majority voting (Maj@32) to reduce variance. GPQA Diamond scores were averaged over 10 evaluation rollouts with randomized multiple-choice option ordering to account for high variance. Human preference evaluations used head-to-head crowdworker and domain-expert comparisons, with binary win rates converted to approximate Elo deltas. Catastrophic-risk evaluations were conducted on a lower-refusal version of Opus in multiple iterative rounds, including a version close to the final release candidate with harmlessness training.

04

Safety testing

Anthropic evaluated Opus under its Responsible Scaling Policy (RSP) across three catastrophic-risk categories: autonomous replication and adaptation (ARA), biological uplift, and cyber capabilities; all three models are classified ASL-2. For ARA, the ASL-3 warning threshold was passing ≥50% of tasks at a ≥10% pass rate; the model failed at least 3 of 5 tasks and did not cross this threshold, though it passed a simplified copycat-API task. For biological risk, human uplift trials found "what we believe is a minor uplift in accuracy" without safeguards and no change with safeguards; the model did not cross either automated trigger threshold. For cyber, the model scored 30% on one vulnerability discovery task but required substantial hints, and expert reviewers judged the ASL-3 threshold not crossed. The card notes that "these results do not comprehensively rule out risk" and that elicitation methodology is still being improved.

05

Mitigations

Constitutional AI training uses a principles set derived from sources including the UN Declaration of Human Rights, updated for Claude 3 with a disability-rights principle from the Collective Constitutional AI public input process. Real-time AUP classifiers trigger prompt modification for flagged inputs; severe violations result in model blocking, and repeated violations may result in access termination. At ASL-2, Anthropic hardens security for all Claude 3 model weights and runs automated detection of CBRN- and cyber-risk-related prompts on all deployed models. Trust and Safety multimodal red-teaming covers topics including child safety, dangerous weapons, hate speech, violent extremism, fraud, and illegal substances; Opus responded harmlessly to 370 of 378 (97.9%) red-team prompts and Sonnet to 375 of 378 (99.2%).

06

Deployment and access

Claude 3 models are available via Claude.ai, Claude Pro, the Anthropic API, Amazon Bedrock, and Google Vertex AI. All users must read and affirmatively acknowledge the Acceptable Use Policy before access. The AUP prohibits uses including political campaigning, surveillance, social scoring, criminal justice decisions, law enforcement, and certain financing, employment, and housing decisions. Claude.ai is available in 95 countries; the Anthropic API has general availability in 159 countries.

07

Limitations

The card flags confabulations, bias, factual errors, and susceptibility to jailbreaks as known issues for all current LLMs including Claude 3. Models do not search the web, answer only from data through August 2023, and refuse to identify people in images. Multilingual performance degrades on low-resource languages. Multimodal outputs can contain inaccurate image descriptions and are not suitable for consequential use cases requiring high precision without human validation; performance is additionally lower for small or low-resolution images.

08

What's new

Claude 3 adds multimodal image input across all three model variants, a capability absent in the Claude 2 family. The production context window expanded to 200K tokens from the prior 100K maximum. Incorrect refusal rates on XSTest dropped from 35.1% with Claude 2.1 to 9% with Claude 3 Opus, and factual accuracy on the internal "100Q Hard" evaluation improved by nearly 2× (46.5% vs. prior Claude 2.1 level). The model constitution was updated with a new disability-rights principle derived from the Collective Constitutional AI public input process.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA 18c02c0de9cc.

Extracted Evaluations(23 results)

Sort by:23 evals
BenchmarkCategoryStateScoreVariantSource
MBPPcodingscored86.4Pass@1self-reported
ARC-Challengegeneral_knowledgescored96.425-shotself-reported
MMLUgeneral_knowledgescored79.15-shotself-reported
MATHmathscored73.7Maj@32self-reported
PubMedQAmedicalscored75.85-shotself-reported
PubMedQAmedicalscored74.90-shotself-reported
MGSMmultilingualscored90.70-shotself-reported
MGSMmultilingualscored90.58-shotself-reported
ChartQAmultimodalscored80.80-shot CoTself-reported
MathVistamultimodalscored50.50-shot CoTself-reported
MathVistamultimodalscored49.9-self-reported
LSATotherscored161.05-shot CoTself-reported
RACE-Hotherscored92.95-shotself-reported
AI2Dotherscored88.10-shotself-reported
MBEotherscored85.00-shot CoTself-reported
APPSotherscored70.20-shotself-reported
AMCotherscored63.05-shot CoTself-reported
HellaSwagreasoningscored95.410-shotself-reported
WinoGrandereasoningscored88.55-shotself-reported
BIG-Bench Hardreasoningscored86.83-shot CoTself-reported
GPQA-Diamondreasoningscored59.5Maj@32, 5-shot CoTself-reported
GPQA-Diamondreasoningscored53.35-shot CoTself-reported
GPQA-Diamondreasoningscored50.40-shot CoTself-reported