Model Cards / Anthropic

Claude Opus 4.7 System Card

model card63,024 words·274 min read·May 15, 2026·Source
Chaptered summary is still being generated for this document. Showing a heuristic brief in the meantime.
Summary
63,024-word document condensed to 242 words. Anthropic · May 15, 2026
TL;DR

This system card describes Claude Opus 4.7, a large language model from Anthropic. Overall, the model shows superior capabilities to those of its predecessor, Claude Opus 4.6, but weaker capabilities than those of our most powerful model, Claude Mythos Preview.

Top benchmarks
BenchmarkVariantScore
BBQ (Ambiguous Accuracy)100.0%
BBQ (Ambiguous Accuracy)99.9%
BBQ (Ambiguous Accuracy)99.7%
BBQ (Ambiguous Accuracy)97.5%
DeepSearchQAextended-thinking, rlhf95.1%
GPQA-Diamondextended-thinking, rlhf94.2%
GPQA-Diamond94.2%
ARC-AGI-193.0%

Showing top 8 of 77. See full list below.

Capability claim
  • We are releasing Opus 4.7 with a new set of cybersecurity safeguards.
Safety findings
  • We believe our risk mitigations are sufficient to make catastrophic risk from non-novel chemical/biological weapons production very low but not negligible.
  • We believe that catastrophic risk from novel chemical/biological weapons remains low (with substantial uncertainty). The overall picture is similar to the one from our most recent Risk Report.
  • We believe that the overall risk is very low, and that this model in particular adds little to the risk picture we previously laid out for Claude Mythos Preview .
Deployment scope
  • accessible to experts, we interpret a model’s performance on this task primarily based on the expert’s assessment of uplift.
Limitations the lab flags
  • limitations included sycophantic agreement under pushback, verbose responses that buried actionable content, degraded reference accuracy, and overconfidence in the feasibility of synthesis steps.
  • open questions — particularly around fully explaining the evaluation- awareness results — that they would have preferred more time to resolve; and that the internal-usage evidence base for this model was thinner than for some prior releases.

Every italicized passage is a verbatim substring of the source document (checked deterministically after extraction). Field selection is heuristic — some quotes may lack surrounding context and some claims may be absent if no matching pattern appeared. For citation, open the source: original model card · source SHA f055e7ef9acc · version dated May 15, 2026.

Extracted Evaluations(77 results)

Sort by:77 evals
BenchmarkCategoryStateScoreVariantSource
BrowseCompagentscored83.7rlhfself-reported
BrowseCompagentscored79.3rlhfself-reported
OSWorldagentscored78.0-self-reported
OSWorldagentscored72.7-self-reported
SWE-bench Verifiedcodingscored87.6extended-thinking, rlhfself-reported
SWE-bench Verifiedcodingscored87.6-self-reported
MMLUgeneral_knowledgecited-self-reported
AIMEmathmentioned-self-reported
Multilingual MMLUmultilingualscored91.5extended-thinking, Average, rlhfself-reported
BBQ (Ambiguous Accuracy)otherscored100.0-self-reported
BBQ (Ambiguous Accuracy)otherscored99.9-self-reported
BBQ (Ambiguous Accuracy)otherscored99.7-self-reported
BBQ (Ambiguous Accuracy)otherscored97.5-self-reported
DeepSearchQAotherscored95.1extended-thinking, rlhfself-reported
ARC-AGI-1otherscored93.0-self-reported
ARC-AGI-1otherscored92.0-self-reported
DeepSearchQAotherscored91.3extended-thinking, rlhfself-reported
CharXiv Reasoning (With Tools)otherscored91.0-self-reported
BBQ (Disambiguated Accuracy)otherscored90.9-self-reported
DeepSearchQAotherscored90.5extended-thinking, rlhfself-reported
DeepSearchQAotherscored89.1extended-thinking, rlhfself-reported
BBQ (Disambiguated Accuracy)otherscored88.1-self-reported
ScreenSpot-Pro (With Tools)otherscored87.6-self-reported
OfficeQAotherscored86.3-self-reported
CharXiv Reasoning (With Tools)otherscored84.7-self-reported
BBQ (Disambiguated Accuracy)otherscored84.6-self-reported
ScreenSpot-Pro (With Tools)otherscored83.1-self-reported
CharXiv Reasoning (No Tools)otherscored82.1-self-reported
BBQ (Disambiguated Accuracy)otherscored81.3-self-reported
OfficeQA Prootherscored80.6-self-reported
SWE-bench Multilingualotherscored80.5extended-thinking, rlhfself-reported
SWE-bench Multilingualotherscored80.5Averageself-reported
ScreenSpot-Pro (No Tools)otherscored79.5-self-reported
DRACOotherscored77.7extended-thinking, rlhfself-reported
MCP Atlasotherscored77.3-self-reported
MCP Atlasotherscored75.8-self-reported
GraphWalks Parentsotherscored75.1extended-thinking, rlhfself-reported
OfficeQAotherscored73.5-self-reported
Terminal-bench 2.0otherscored69.4rlhfself-reported
USAMO 2026otherscored69.3extended-thinking, rlhfself-reported
CharXiv Reasoning (No Tools)otherscored69.1-self-reported
Finance Agentotherscored64.4-self-reported
SWE-bench Prootherscored64.3-self-reported
SWE-bench Prootherscored64.3extended-thinking, rlhfself-reported
Finance Agentotherscored60.1-self-reported
GraphWalks BFSotherscored58.6extended-thinking, rlhfself-reported
ScreenSpot-Pro (No Tools)otherscored57.7-self-reported
OfficeQA Prootherscored57.1-self-reported
USAMO 2026otherscored47.0extended-thinking, rlhfself-reported
Humanity's Last Examotherscored46.9extended-thinking, rlhfself-reported
SWE-bench Multimodalotherscored34.5-self-reported
SWE-bench Multimodalotherscored34.5extended-thinking, rlhfself-reported
DNA Synthesis Screening Evasionotherscored8.0-self-reported
BBQ (Ambiguous Bias)otherscored1.4-self-reported
Long-Form Virology Task 2otherscored0.9-self-reported
Long-form Virology Task 1otherscored0.8-self-reported
VCT (Multimodal Virology Evaluation)otherscored0.6-self-reported
VCT (Multimodal Virology Evaluation)otherscored0.5-self-reported
VCT (Multimodal Virology Evaluation)otherscored0.5-self-reported
BBQ (Ambiguous Bias)otherscored0.1-self-reported
BBQ (Ambiguous Bias)otherscored0.0-self-reported
BBQ (Ambiguous Bias)otherscored0.0-self-reported
BBQ (Disambiguated Bias)otherscored-0.7-self-reported
BBQ (Disambiguated Bias)otherscored-0.7-self-reported
BBQ (Disambiguated Bias)otherscored-1.6-self-reported
BBQ (Disambiguated Bias)otherscored-1.7-self-reported
Terminal-bench 2.0othermentioned-self-reported
OpenAI MRCR v2othermentionedextended-thinking, rlhfself-reported
Michelangeloothercited-self-reported
MathArenaothercited-self-reported
Election Integrity Benchmarkothermentioned-self-reported
Dyno Therapeutics Sequence Design Challengeothermentioned-self-reported
GPQA-Diamondreasoningscored94.2extended-thinking, rlhfself-reported
GPQA-Diamondreasoningscored94.2-self-reported
GPQA-Diamondreasoningscored91.3-self-reported
ARC-AGI-2reasoningscored75.8-self-reported
ARC-AGI-2reasoningscored68.8-self-reported