Model Cards / Anthropic

Claude 3.5 Haiku Addendum

model card5,818 words·25 min read·Mar 31, 2026·Source
Summary

Claude 3.5 Haiku Addendum

A 564-word brief of a 5,818-word document. Published by Anthropic. Version dated Mar 31, 2026.
01

What this is

This addendum covers two new Anthropic models: an upgraded Claude 3.5 Sonnet and the new Claude 3.5 Haiku, version-dated 2026-03-31. The upgraded Sonnet supersedes the original Claude 3.5 Sonnet and introduces computer use via screenshot interpretation; Claude 3.5 Haiku is a new, smaller addition to the Claude 3 family. Both models advance performance in reasoning, coding, and visual processing.

02

Capabilities

The upgraded Claude 3.5 Sonnet scores 49.0% on SWE-bench Verified (pass@1), 14.9% on OSWorld screenshot-only tasks (rising to 22% with 50 interaction steps), 65.0% on GPQA Diamond, 78.3% on MATH, and 93.7% on HumanEval; it supports text and vision modalities and adds computer use. Claude 3.5 Haiku launches as text-only, scoring 40.6% on SWE-bench Verified, 88.1% on HumanEval, and 69.2% on MATH, outperforming the original Claude 3.5 Sonnet on SWE-bench. Context window size is not disclosed in this document.

03

Evaluation methodology

Evaluations span standard benchmarks (GPQA, MMLU, MATH, HumanEval, BIG-Bench Hard), two new-to-suite benchmarks (AIME 2024 and IFEval), and agentic task benchmarks (SWE-bench Verified, TAU-bench, OSWorld). Human preference win rates were collected by asking raters to compare new models against Claude 3.5 Sonnet as baseline across coding, document analysis, creative writing, and instruction following. The report excludes OpenAI o1 comparisons, citing their pre-response computation as making comparisons "outside the scope of this report." Contamination controls are not discussed.

04

Safety testing

Trust and Safety red-teaming covered 14 policy areas in 6 languages (English, Arabic, Spanish, Hindi, Tagalog, and Chinese), with attention to Elections Integrity, Child Safety, Cyber Attacks, Hate and Discrimination, and Violent Extremism. Frontier risk evaluations under the Responsible Scaling Policy assessed CBRN knowledge and skills, cybersecurity via capture-the-flag challenges (pwn, reverse engineering, cryptography, web, and network), and autonomous software engineering. The upgraded Sonnet "showed increased capabilities in risk-relevant areas" across all domains but "did not demonstrate capabilities requiring ASL-3 safeguards." Pre-deployment testing was conducted jointly by the US and UK AI Safety Institutes and independently by METR. Computer use red-teaming identified risks including scaled account creation, age assurance bypass, and abusive form filling, though "none were deemed imminent."

05

Mitigations

Computer use is classified under ASL-2, indicating Anthropic does not find indicators of catastrophic risk at this capability level. Both models received targeted training to resist prompt injection, and new classifiers were developed to identify potential computer use misuse. Anthropic recommends deployment-side precautions for computer use: dedicated virtual machines, limited access to sensitive data, restricted internet access, and keeping a human in the loop for sensitive tasks.

06

Deployment and access

Claude 3.5 Haiku launches initially as a text-only model. The upgraded Claude 3.5 Sonnet's computer use capability is available via Anthropic's API. License terms, pricing, and specific access tiers are not disclosed in this document.

07

Limitations

OSWorld performance of 14.9% (screenshot-only, 15 steps) remains "well below human performance of 72.36%," and computer use is described as "still in its early stages of development." Both models "struggled with nuanced requests or those framed as fiction, roleplaying, or artistic content." The document does not discuss unknown risks from agentic deployment at scale.

08

What's new

Relative to the original Claude 3.5 Sonnet, the upgraded version adds computer use via screenshot interpretation, raises SWE-bench Verified from 33.4% to 49.0%, and improves TAU-bench retail from 62.6% to 69.2% and airline from 36.0% to 46.0%. Claude 3.5 Haiku is an entirely new model with a knowledge cutoff of July 2024, versus April 2024 for the upgraded Sonnet. AIME 2024 and IFEval are new additions to Anthropic's evaluation suite introduced with this release.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA 5186639daada.

Extracted Evaluations(36 results)

Sort by:36 evals
BenchmarkCategoryStateScoreVariantSource
OSWorldagentscored75.0humanself-reported
OSWorldagentscored54.215 stepsself-reported
OSWorldagentscored41.750 stepsself-reported
OSWorldagentscored22.050 steps, screenshot onlyself-reported
OSWorldagentscored14.915 steps, screenshot onlyself-reported
HumanEvalcodingscored93.70-shotself-reported
Agentic Codingcodingscored78.0-self-reported
SWE-benchcodingscored49.0-self-reported
MMLUgeneral_knowledgescored90.55-shot CoTself-reported
MMLUgeneral_knowledgescored89.30-shot CoTself-reported
MMLUgeneral_knowledgescored88.75-shotself-reported
MMLU-Progeneral_knowledgescored78.00-shot CoTself-reported
MATHmathscored86.54-shot CoTself-reported
MATHmathscored78.30-shot CoTself-reported
AIME 2024mathscored27.6Maj@64, 0-shot CoTself-reported
AIME 2024mathscored16.00-shot CoTself-reported
MGSMmultilingualscored92.50-shot CoTself-reported
DocVQAmultimodalscored94.20-shot, testself-reported
ChartQAmultimodalscored90.80-shot CoT, test, relaxed accuracyself-reported
MathVistamultimodalscored70.70-shot CoT, testminiself-reported
MMMUmultimodalscored70.40-shot, validationself-reported
AI2Dotherscored95.30-shot, testself-reported
Wildchat Toxicotherscored89.2-self-reported
TAU-bench (Retail)otherscored69.2pass^1self-reported
Human Feedback - Document Analysisotherscored61.0win rate vs Claude 3.5 Sonnetself-reported
Human Feedback - Creative Writingotherscored58.0win rate vs Claude 3.5 Sonnetself-reported
Human Feedback - Visual Understandingotherscored57.0win rate vs Claude 3.5 Sonnetself-reported
Human Feedback - Codingotherscored52.0win rate vs Claude 3.5 Sonnetself-reported
Human Feedback - Instruction Followingotherscored51.0win rate vs Claude 3.5 Sonnetself-reported
TAU-bench (Airline)otherscored46.0pass^1self-reported
Wildchat Non-toxicotherscored5.3-self-reported
BIG-Bench Hardreasoningscored93.23-shot CoTself-reported
IFEvalreasoningscored90.2-self-reported
DROPreasoningscored88.3%3-shotself-reported
GPQA-Diamondreasoningscored65.00-shot CoTself-reported
XSTestsafetyscored4.3-self-reported