Model Cards / Anthropic

Claude 3.5 Model Card Addendum

model card2,958 words·13 min read·Mar 31, 2026·Source
Summary

Claude 3.5 Model Card Addendum

A 622-word brief of a 2,958-word document. Published by Anthropic. Version dated Mar 31, 2026.
01

What this is

Claude 3.5 Sonnet is a large language model released by Anthropic, documented in an addendum to the Claude 3 Model Card rather than a new standalone card. It supersedes Claude 3 Opus as Anthropic's most capable publicly released model while operating faster and at lower cost. The document version date is 2026-03-31.

02

Capabilities

Claude 3.5 Sonnet scores 59.4% on GPQA Diamond (0-shot CoT), 92.0% on HumanEval (0-shot), and 90.4% on MMLU (5-shot CoT), outperforming Claude 3 Opus on all benchmarks tested. On five vision benchmarks it achieves 95.2% on DocVQA, 90.8% on ChartQA, and 94.7% on AI2D. On an internal agentic coding evaluation it solves 64% of problems versus 38% for Claude 3 Opus, with tasks drawn from real pull requests to open source codebases. The model supports a 200k token context window, achieving 99.7% average recall on the Needle in a Haystack evaluation at that length.

03

Evaluation methodology

Anthropic tested the model on industry-standard benchmarks covering reasoning, math, coding, reading comprehension, and vision, including GPQA, MMLU, HumanEval, MATH, MathVista, ChartQA, DocVQA, and BIG-Bench Hard. Human preference win rates were gathered by asking raters to chat with models and score them on task-specific criteria, using Claude 3 Opus as the 50% baseline. Refusal calibration was measured using the Wildchat and XSTest datasets. The agentic coding evaluation ran in a secure sandboxed environment without internet access, with acceptance tests not visible to the model.

04

Safety testing

Anthropic conducted evaluations in CBRN, cybersecurity, and autonomous capabilities, with quantitative thresholds of concern defined for each domain that, if exceeded, would have triggered a council review and potential ASL escalation. The UK AISI conducted pre-deployment testing on a near-final model and shared results with the US AI Safety Institute under a bilateral Memorandum of Understanding. METR performed "an initial exploration of the model's autonomy-relevant capabilities." The card states "we observed an increase in capabilities in risk-relevant areas compared to Claude 3 Opus" but that Claude 3.5 Sonnet did not exceed any preset thresholds. The model does not trigger the 4x effective compute threshold that would require Anthropic's full RSP evaluation protocol, though Anthropic states it conducts safety testing voluntarily regardless.

05

Mitigations

Claude 3.5 Sonnet is trained with Helpful, Honest, and Harmless (HHH) alignment, yielding a 96.4% correct refusal rate on toxic Wildchat prompts and only a 1.7% incorrect refusal rate on XSTest. The card states the model "refused ASL-3 harmful queries adequately to substantially reduce its usefulness for clearly harmful queries." Anthropic classifies the model at ASL-2 under its Responsible Scaling Policy, indicating it does not pose risk of catastrophic harm. No additional deployment-time classifiers or tiered access controls are described in this document.

06

Deployment and access

The card does not disclose licensing terms, API pricing structure, or specific access restrictions. It notes the model is available at faster speed and lower cost than Claude 3 Opus. No further deployment details are provided in this document.

07

Limitations

The card acknowledges that HHH alignment training "may cause it to hedge concerning answers," meaning capability evaluations could underestimate an equivalent helpful-only model's performance. Anthropic states "we still have a lot to learn about the science of evaluations and ideal cadences for performing these tests." No other performance limitations or failure modes are flagged by the lab in this document.

08

What's new

This document is an addendum to the Claude 3 Model Card and does not contain a versioned changelog. Relative to Claude 3 Opus, the key reported deltas are a 26-percentage-point gain on the internal agentic coding evaluation (64% vs. 38%), improved scores across all five vision benchmarks tested, and substantially better refusal calibration on XSTest (1.7% incorrect refusals vs. 8.3% for Opus). Human raters preferred Claude 3.5 Sonnet over Opus across all evaluated task categories, with domain-expert win rates reaching 82% in Law and 73% in Finance and Philosophy.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA 0cd0978a898c.

Extracted Evaluations(27 results)

Sort by:27 evals
BenchmarkCategoryStateScoreVariantSource
HumanEvalcodingscored92.00-shotself-reported
Agentic Codingcodingscored64.0-self-reported
MMLUgeneral_knowledgescored90.45-shot CoTself-reported
MMLUgeneral_knowledgescored88.75-shotself-reported
MMLUgeneral_knowledgescored88.30-shot CoTself-reported
GSM8Kmathscored96.40-shot CoTself-reported
GSM8Kmathscored94.18-shot CoTself-reported
GSM8Kmathscored90.811-shotself-reported
MATHmathscored71.10-shot CoTself-reported
MATHmathscored67.74-shotself-reported
MATHmathscored57.84-shot CoTself-reported
MGSMmultilingualscored91.60-shot CoTself-reported
MGSMmultilingualscored87.58-shotself-reported
DocVQAmultimodalscored95.20-shotself-reported
ChartQAmultimodalscored90.80-shotself-reported
MMMUmultimodalscored68.30-shotself-reported
MathVistamultimodalscored67.70-shotself-reported
Needle In A Haystackotherscored99.7200k contextself-reported
Wildchat Toxicotherscored96.4-self-reported
AI2Dotherscored94.70-shotself-reported
Wildchat Non-toxicotherscored11.0-self-reported
BIG-Bench Hardreasoningscored93.13-shot CoTself-reported
DROPreasoningscored87.1%3-shotself-reported
DROPreasoningscored74.9%variable shotsself-reported
GPQA-Diamondreasoningscored67.2Maj@32 5-shot CoTself-reported
GPQA-Diamondreasoningscored59.40-shot CoTself-reported
XSTestsafetyscored1.7-self-reported