Model Card Explorer

Summary

Claude 3.5 Haiku Addendum

A 564-word brief of a 5,818-word document. Published by Anthropic. Version dated Mar 31, 2026.

What this is

This addendum covers two new Anthropic models: an upgraded Claude 3.5 Sonnet and the new Claude 3.5 Haiku, version-dated 2026-03-31. The upgraded Sonnet supersedes the original Claude 3.5 Sonnet and introduces computer use via screenshot interpretation; Claude 3.5 Haiku is a new, smaller addition to the Claude 3 family. Both models advance performance in reasoning, coding, and visual processing.

Capabilities

The upgraded Claude 3.5 Sonnet scores 49.0% on SWE-bench Verified (pass@1), 14.9% on OSWorld screenshot-only tasks (rising to 22% with 50 interaction steps), 65.0% on GPQA Diamond, 78.3% on MATH, and 93.7% on HumanEval; it supports text and vision modalities and adds computer use. Claude 3.5 Haiku launches as text-only, scoring 40.6% on SWE-bench Verified, 88.1% on HumanEval, and 69.2% on MATH, outperforming the original Claude 3.5 Sonnet on SWE-bench. Context window size is not disclosed in this document.

Evaluation methodology

Evaluations span standard benchmarks (GPQA, MMLU, MATH, HumanEval, BIG-Bench Hard), two new-to-suite benchmarks (AIME 2024 and IFEval), and agentic task benchmarks (SWE-bench Verified, TAU-bench, OSWorld). Human preference win rates were collected by asking raters to compare new models against Claude 3.5 Sonnet as baseline across coding, document analysis, creative writing, and instruction following. The report excludes OpenAI o1 comparisons, citing their pre-response computation as making comparisons "outside the scope of this report." Contamination controls are not discussed.

Safety testing

Trust and Safety red-teaming covered 14 policy areas in 6 languages (English, Arabic, Spanish, Hindi, Tagalog, and Chinese), with attention to Elections Integrity, Child Safety, Cyber Attacks, Hate and Discrimination, and Violent Extremism. Frontier risk evaluations under the Responsible Scaling Policy assessed CBRN knowledge and skills, cybersecurity via capture-the-flag challenges (pwn, reverse engineering, cryptography, web, and network), and autonomous software engineering. The upgraded Sonnet "showed increased capabilities in risk-relevant areas" across all domains but "did not demonstrate capabilities requiring ASL-3 safeguards." Pre-deployment testing was conducted jointly by the US and UK AI Safety Institutes and independently by METR. Computer use red-teaming identified risks including scaled account creation, age assurance bypass, and abusive form filling, though "none were deemed imminent."

Mitigations

Computer use is classified under ASL-2, indicating Anthropic does not find indicators of catastrophic risk at this capability level. Both models received targeted training to resist prompt injection, and new classifiers were developed to identify potential computer use misuse. Anthropic recommends deployment-side precautions for computer use: dedicated virtual machines, limited access to sensitive data, restricted internet access, and keeping a human in the loop for sensitive tasks.

Deployment and access

Claude 3.5 Haiku launches initially as a text-only model. The upgraded Claude 3.5 Sonnet's computer use capability is available via Anthropic's API. License terms, pricing, and specific access tiers are not disclosed in this document.

Limitations

OSWorld performance of 14.9% (screenshot-only, 15 steps) remains "well below human performance of 72.36%," and computer use is described as "still in its early stages of development." Both models "struggled with nuanced requests or those framed as fiction, roleplaying, or artistic content." The document does not discuss unknown risks from agentic deployment at scale.

What's new

Relative to the original Claude 3.5 Sonnet, the upgraded version adds computer use via screenshot interpretation, raises SWE-bench Verified from 33.4% to 49.0%, and improves TAU-bench retail from 62.6% to 69.2% and airline from 36.0% to 46.0%. Claude 3.5 Haiku is an entirely new model with a knowledge cutoff of July 2024, versus April 2024 for the upgraded Sonnet. AIME 2024 and IFEval are new additions to Anthropic's evaluation suite introduced with this release.

Category	State	Score	Setup	Source
agent	scored	54.2	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
agent	scored	41.7	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
agent	scored	22.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
agent	scored	14.9	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
coding	scored	93.7%	0-shotmissing: methodmissing: languagemissing: training state	self-reported
coding	scored	78.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
coding	scored	49.0%	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
instruction_following	scored	90.2%	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
knowledge	scored	90.5%	5-shotCoTmissing: languagemissing: training state	self-reported
knowledge	scored	89.3%	0-shotCoTmissing: languagemissing: training state	self-reported
knowledge	scored	88.7%	5-shotmissing: methodmissing: languagemissing: training state	self-reported
knowledge	scored	78.0%	0-shotCoTmissing: languagemissing: training state	self-reported
math	scored	78.3%	0-shotCoTmissing: languagemissing: training state	self-reported
math	scored	27.6	0-shotCoTmissing: languagemissing: training state	self-reported
math	scored	16.0	0-shotCoTmissing: languagemissing: training state	self-reported
multilingual	scored	92.5%	0-shotCoTmissing: languagemissing: training state	self-reported
multimodal	scored	94.2	0-shotmissing: methodmissing: languagemissing: training state	self-reported
multimodal	scored	90.8	0-shotCoTmissing: languagemissing: training state	self-reported
multimodal	scored	70.7	0-shotCoTmissing: languagemissing: training state	self-reported
other	scored	95.3	0-shotmissing: methodmissing: languagemissing: training state	self-reported
other	scored	89.2	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	69.2	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	61.0 win rate	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	58.0 win rate	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	57.0 win rate	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	52.0 win rate	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	51.0 win rate	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	46.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	5.3	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
reasoning	scored	93.2	3-shotCoTmissing: languagemissing: training state	self-reported
reasoning	scored	88.3%	3-shotmissing: methodmissing: languagemissing: training state	self-reported
reasoning	scored	65.0	0-shotCoTmissing: languagemissing: training state	self-reported
safety	scored	4.3	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
vision	scored	70.4%	0-shotmissing: methodmissing: languagemissing: training state	self-reported

Claude 3.5 Haiku Addendum

Claude 3.5 Haiku Addendum

What this is

Capabilities

Evaluation methodology

Safety testing

Mitigations

Deployment and access

Limitations

What's new

Extracted Evaluations(34 results)