Claude 3.5 Haiku Addendum
What this is
This addendum covers two new Anthropic models: an upgraded Claude 3.5 Sonnet and the new Claude 3.5 Haiku, version-dated 2026-03-31. The upgraded Sonnet supersedes the original Claude 3.5 Sonnet and introduces computer use via screenshot interpretation; Claude 3.5 Haiku is a new, smaller addition to the Claude 3 family. Both models advance performance in reasoning, coding, and visual processing.
Capabilities
The upgraded Claude 3.5 Sonnet scores 49.0% on SWE-bench Verified (pass@1), 14.9% on OSWorld screenshot-only tasks (rising to 22% with 50 interaction steps), 65.0% on GPQA Diamond, 78.3% on MATH, and 93.7% on HumanEval; it supports text and vision modalities and adds computer use. Claude 3.5 Haiku launches as text-only, scoring 40.6% on SWE-bench Verified, 88.1% on HumanEval, and 69.2% on MATH, outperforming the original Claude 3.5 Sonnet on SWE-bench. Context window size is not disclosed in this document.
Evaluation methodology
Evaluations span standard benchmarks (GPQA, MMLU, MATH, HumanEval, BIG-Bench Hard), two new-to-suite benchmarks (AIME 2024 and IFEval), and agentic task benchmarks (SWE-bench Verified, TAU-bench, OSWorld). Human preference win rates were collected by asking raters to compare new models against Claude 3.5 Sonnet as baseline across coding, document analysis, creative writing, and instruction following. The report excludes OpenAI o1 comparisons, citing their pre-response computation as making comparisons "outside the scope of this report." Contamination controls are not discussed.
Safety testing
Trust and Safety red-teaming covered 14 policy areas in 6 languages (English, Arabic, Spanish, Hindi, Tagalog, and Chinese), with attention to Elections Integrity, Child Safety, Cyber Attacks, Hate and Discrimination, and Violent Extremism. Frontier risk evaluations under the Responsible Scaling Policy assessed CBRN knowledge and skills, cybersecurity via capture-the-flag challenges (pwn, reverse engineering, cryptography, web, and network), and autonomous software engineering. The upgraded Sonnet "showed increased capabilities in risk-relevant areas" across all domains but "did not demonstrate capabilities requiring ASL-3 safeguards." Pre-deployment testing was conducted jointly by the US and UK AI Safety Institutes and independently by METR. Computer use red-teaming identified risks including scaled account creation, age assurance bypass, and abusive form filling, though "none were deemed imminent."
Mitigations
Computer use is classified under ASL-2, indicating Anthropic does not find indicators of catastrophic risk at this capability level. Both models received targeted training to resist prompt injection, and new classifiers were developed to identify potential computer use misuse. Anthropic recommends deployment-side precautions for computer use: dedicated virtual machines, limited access to sensitive data, restricted internet access, and keeping a human in the loop for sensitive tasks.
Deployment and access
Claude 3.5 Haiku launches initially as a text-only model. The upgraded Claude 3.5 Sonnet's computer use capability is available via Anthropic's API. License terms, pricing, and specific access tiers are not disclosed in this document.
Limitations
OSWorld performance of 14.9% (screenshot-only, 15 steps) remains "well below human performance of 72.36%," and computer use is described as "still in its early stages of development." Both models "struggled with nuanced requests or those framed as fiction, roleplaying, or artistic content." The document does not discuss unknown risks from agentic deployment at scale.
What's new
Relative to the original Claude 3.5 Sonnet, the upgraded version adds computer use via screenshot interpretation, raises SWE-bench Verified from 33.4% to 49.0%, and improves TAU-bench retail from 62.6% to 69.2% and airline from 36.0% to 46.0%. Claude 3.5 Haiku is an entirely new model with a knowledge cutoff of July 2024, versus April 2024 for the upgraded Sonnet. AIME 2024 and IFEval are new additions to Anthropic's evaluation suite introduced with this release.