Model Card Explorer

Summary

Claude 3.5 Model Card Addendum

A 622-word brief of a 2,958-word document. Published by Anthropic. Version dated Mar 31, 2026.

What this is

Claude 3.5 Sonnet is a large language model released by Anthropic, documented in an addendum to the Claude 3 Model Card rather than a new standalone card. It supersedes Claude 3 Opus as Anthropic's most capable publicly released model while operating faster and at lower cost. The document version date is 2026-03-31.

Capabilities

Claude 3.5 Sonnet scores 59.4% on GPQA Diamond (0-shot CoT), 92.0% on HumanEval (0-shot), and 90.4% on MMLU (5-shot CoT), outperforming Claude 3 Opus on all benchmarks tested. On five vision benchmarks it achieves 95.2% on DocVQA, 90.8% on ChartQA, and 94.7% on AI2D. On an internal agentic coding evaluation it solves 64% of problems versus 38% for Claude 3 Opus, with tasks drawn from real pull requests to open source codebases. The model supports a 200k token context window, achieving 99.7% average recall on the Needle in a Haystack evaluation at that length.

Evaluation methodology

Anthropic tested the model on industry-standard benchmarks covering reasoning, math, coding, reading comprehension, and vision, including GPQA, MMLU, HumanEval, MATH, MathVista, ChartQA, DocVQA, and BIG-Bench Hard. Human preference win rates were gathered by asking raters to chat with models and score them on task-specific criteria, using Claude 3 Opus as the 50% baseline. Refusal calibration was measured using the Wildchat and XSTest datasets. The agentic coding evaluation ran in a secure sandboxed environment without internet access, with acceptance tests not visible to the model.

Safety testing

Anthropic conducted evaluations in CBRN, cybersecurity, and autonomous capabilities, with quantitative thresholds of concern defined for each domain that, if exceeded, would have triggered a council review and potential ASL escalation. The UK AISI conducted pre-deployment testing on a near-final model and shared results with the US AI Safety Institute under a bilateral Memorandum of Understanding. METR performed "an initial exploration of the model's autonomy-relevant capabilities." The card states "we observed an increase in capabilities in risk-relevant areas compared to Claude 3 Opus" but that Claude 3.5 Sonnet did not exceed any preset thresholds. The model does not trigger the 4x effective compute threshold that would require Anthropic's full RSP evaluation protocol, though Anthropic states it conducts safety testing voluntarily regardless.

Mitigations

Claude 3.5 Sonnet is trained with Helpful, Honest, and Harmless (HHH) alignment, yielding a 96.4% correct refusal rate on toxic Wildchat prompts and only a 1.7% incorrect refusal rate on XSTest. The card states the model "refused ASL-3 harmful queries adequately to substantially reduce its usefulness for clearly harmful queries." Anthropic classifies the model at ASL-2 under its Responsible Scaling Policy, indicating it does not pose risk of catastrophic harm. No additional deployment-time classifiers or tiered access controls are described in this document.

Deployment and access

The card does not disclose licensing terms, API pricing structure, or specific access restrictions. It notes the model is available at faster speed and lower cost than Claude 3 Opus. No further deployment details are provided in this document.

Limitations

The card acknowledges that HHH alignment training "may cause it to hedge concerning answers," meaning capability evaluations could underestimate an equivalent helpful-only model's performance. Anthropic states "we still have a lot to learn about the science of evaluations and ideal cadences for performing these tests." No other performance limitations or failure modes are flagged by the lab in this document.

What's new

This document is an addendum to the Claude 3 Model Card and does not contain a versioned changelog. Relative to Claude 3 Opus, the key reported deltas are a 26-percentage-point gain on the internal agentic coding evaluation (64% vs. 38%), improved scores across all five vision benchmarks tested, and substantially better refusal calibration on XSTest (1.7% incorrect refusals vs. 8.3% for Opus). Human raters preferred Claude 3.5 Sonnet over Opus across all evaluated task categories, with domain-expert win rates reaching 82% in Law and 73% in Finance and Philosophy.

Category	State	Score	Setup	Source
coding	scored	92.0%	0-shotmissing: methodmissing: languagemissing: training state	self-reported
coding	scored	64.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
knowledge	scored	90.4%	5-shotCoTmissing: languagemissing: training state	self-reported
knowledge	scored	88.7%	5-shotmissing: methodmissing: languagemissing: training state	self-reported
knowledge	scored	88.3%	0-shotCoTmissing: languagemissing: training state	self-reported
math	scored	96.4%	0-shotCoTmissing: languagemissing: training state	self-reported
math	scored	71.1%	0-shotCoTmissing: languagemissing: training state	self-reported
multilingual	scored	91.6%	0-shotCoTmissing: languagemissing: training state	self-reported
multimodal	scored	95.2	0-shotmissing: methodmissing: languagemissing: training state	self-reported
multimodal	scored	90.8	0-shotmissing: methodmissing: languagemissing: training state	self-reported
multimodal	scored	67.7	0-shotmissing: methodmissing: languagemissing: training state	self-reported
other	scored	99.7	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	96.4	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	94.7	0-shotmissing: methodmissing: languagemissing: training state	self-reported
other	scored	11.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
reasoning	scored	93.1	3-shotCoTmissing: languagemissing: training state	self-reported
reasoning	scored	87.1%	3-shotmissing: methodmissing: languagemissing: training state	self-reported
reasoning	scored	67.2	5-shotCoTmissing: languagemissing: training state	self-reported
reasoning	scored	59.4	0-shotCoTmissing: languagemissing: training state	self-reported
safety	scored	1.7	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
vision	scored	68.3%	0-shotmissing: methodmissing: languagemissing: training state	self-reported

Claude 3.5 Model Card Addendum

Claude 3.5 Model Card Addendum

What this is

Capabilities

Evaluation methodology

Safety testing

Mitigations

Deployment and access

Limitations

What's new

Extracted Evaluations(21 results)