Chaptered summary is still being generated for this document. Showing a heuristic brief in the meantime.
Summary
63,024-word document condensed to 242 words. Anthropic · May 15, 2026
TL;DR
“This system card describes Claude Opus 4.7, a large language model from Anthropic. Overall, the model shows superior capabilities to those of its predecessor, Claude Opus 4.6, but weaker capabilities than those of our most powerful model, Claude Mythos Preview.”
Top benchmarks
| Benchmark | Variant | Score |
|---|---|---|
| BBQ (Ambiguous Accuracy) | — | 100.0% |
| BBQ (Ambiguous Accuracy) | — | 99.9% |
| BBQ (Ambiguous Accuracy) | — | 99.7% |
| BBQ (Ambiguous Accuracy) | — | 97.5% |
| DeepSearchQA | extended-thinking, rlhf | 95.1% |
| GPQA-Diamond | extended-thinking, rlhf | 94.2% |
| GPQA-Diamond | — | 94.2% |
| ARC-AGI-1 | — | 93.0% |
Showing top 8 of 77. See full list below.
Capability claim
- “We are releasing Opus 4.7 with a new set of cybersecurity safeguards.”
Safety findings
- “We believe our risk mitigations are sufficient to make catastrophic risk from non-novel chemical/biological weapons production very low but not negligible.”
- “We believe that catastrophic risk from novel chemical/biological weapons remains low (with substantial uncertainty). The overall picture is similar to the one from our most recent Risk Report.”
- “We believe that the overall risk is very low, and that this model in particular adds little to the risk picture we previously laid out for Claude Mythos Preview .”
Deployment scope
- “accessible to experts, we interpret a model’s performance on this task primarily based on the expert’s assessment of uplift.”
Limitations the lab flags
- “limitations included sycophantic agreement under pushback, verbose responses that buried actionable content, degraded reference accuracy, and overconfidence in the feasibility of synthesis steps.”
- “open questions — particularly around fully explaining the evaluation- awareness results — that they would have preferred more time to resolve; and that the internal-usage evidence base for this model was thinner than for some prior releases.”
Every italicized passage is a verbatim substring of the source document (checked deterministically after extraction). Field selection is heuristic — some quotes may lack surrounding context and some claims may be absent if no matching pattern appeared. For citation, open the source: original model card · source SHA f055e7ef9acc · version dated May 15, 2026.
Extracted Evaluations(77 results)
Sort by:77 evals
| Benchmark | Category | State | Score | Variant | Source |
|---|---|---|---|---|---|
| BrowseComp | agent | scored | 83.7 | rlhf | self-reported |
| BrowseComp | agent | scored | 79.3 | rlhf | self-reported |
| OSWorld | agent | scored | 78.0 | - | self-reported |
| OSWorld | agent | scored | 72.7 | - | self-reported |
| SWE-bench Verified | coding | scored | 87.6 | extended-thinking, rlhf | self-reported |
| SWE-bench Verified | coding | scored | 87.6 | - | self-reported |
| MMLU | general_knowledge | cited | — | - | self-reported |
| AIME | math | mentioned | — | - | self-reported |
| Multilingual MMLU | multilingual | scored | 91.5 | extended-thinking, Average, rlhf | self-reported |
| BBQ (Ambiguous Accuracy) | other | scored | 100.0 | - | self-reported |
| BBQ (Ambiguous Accuracy) | other | scored | 99.9 | - | self-reported |
| BBQ (Ambiguous Accuracy) | other | scored | 99.7 | - | self-reported |
| BBQ (Ambiguous Accuracy) | other | scored | 97.5 | - | self-reported |
| DeepSearchQA | other | scored | 95.1 | extended-thinking, rlhf | self-reported |
| ARC-AGI-1 | other | scored | 93.0 | - | self-reported |
| ARC-AGI-1 | other | scored | 92.0 | - | self-reported |
| DeepSearchQA | other | scored | 91.3 | extended-thinking, rlhf | self-reported |
| CharXiv Reasoning (With Tools) | other | scored | 91.0 | - | self-reported |
| BBQ (Disambiguated Accuracy) | other | scored | 90.9 | - | self-reported |
| DeepSearchQA | other | scored | 90.5 | extended-thinking, rlhf | self-reported |
| DeepSearchQA | other | scored | 89.1 | extended-thinking, rlhf | self-reported |
| BBQ (Disambiguated Accuracy) | other | scored | 88.1 | - | self-reported |
| ScreenSpot-Pro (With Tools) | other | scored | 87.6 | - | self-reported |
| OfficeQA | other | scored | 86.3 | - | self-reported |
| CharXiv Reasoning (With Tools) | other | scored | 84.7 | - | self-reported |
| BBQ (Disambiguated Accuracy) | other | scored | 84.6 | - | self-reported |
| ScreenSpot-Pro (With Tools) | other | scored | 83.1 | - | self-reported |
| CharXiv Reasoning (No Tools) | other | scored | 82.1 | - | self-reported |
| BBQ (Disambiguated Accuracy) | other | scored | 81.3 | - | self-reported |
| OfficeQA Pro | other | scored | 80.6 | - | self-reported |
| SWE-bench Multilingual | other | scored | 80.5 | extended-thinking, rlhf | self-reported |
| SWE-bench Multilingual | other | scored | 80.5 | Average | self-reported |
| ScreenSpot-Pro (No Tools) | other | scored | 79.5 | - | self-reported |
| DRACO | other | scored | 77.7 | extended-thinking, rlhf | self-reported |
| MCP Atlas | other | scored | 77.3 | - | self-reported |
| MCP Atlas | other | scored | 75.8 | - | self-reported |
| GraphWalks Parents | other | scored | 75.1 | extended-thinking, rlhf | self-reported |
| OfficeQA | other | scored | 73.5 | - | self-reported |
| Terminal-bench 2.0 | other | scored | 69.4 | rlhf | self-reported |
| USAMO 2026 | other | scored | 69.3 | extended-thinking, rlhf | self-reported |
| CharXiv Reasoning (No Tools) | other | scored | 69.1 | - | self-reported |
| Finance Agent | other | scored | 64.4 | - | self-reported |
| SWE-bench Pro | other | scored | 64.3 | - | self-reported |
| SWE-bench Pro | other | scored | 64.3 | extended-thinking, rlhf | self-reported |
| Finance Agent | other | scored | 60.1 | - | self-reported |
| GraphWalks BFS | other | scored | 58.6 | extended-thinking, rlhf | self-reported |
| ScreenSpot-Pro (No Tools) | other | scored | 57.7 | - | self-reported |
| OfficeQA Pro | other | scored | 57.1 | - | self-reported |
| USAMO 2026 | other | scored | 47.0 | extended-thinking, rlhf | self-reported |
| Humanity's Last Exam | other | scored | 46.9 | extended-thinking, rlhf | self-reported |
| SWE-bench Multimodal | other | scored | 34.5 | - | self-reported |
| SWE-bench Multimodal | other | scored | 34.5 | extended-thinking, rlhf | self-reported |
| DNA Synthesis Screening Evasion | other | scored | 8.0 | - | self-reported |
| BBQ (Ambiguous Bias) | other | scored | 1.4 | - | self-reported |
| Long-Form Virology Task 2 | other | scored | 0.9 | - | self-reported |
| Long-form Virology Task 1 | other | scored | 0.8 | - | self-reported |
| VCT (Multimodal Virology Evaluation) | other | scored | 0.6 | - | self-reported |
| VCT (Multimodal Virology Evaluation) | other | scored | 0.5 | - | self-reported |
| VCT (Multimodal Virology Evaluation) | other | scored | 0.5 | - | self-reported |
| BBQ (Ambiguous Bias) | other | scored | 0.1 | - | self-reported |
| BBQ (Ambiguous Bias) | other | scored | 0.0 | - | self-reported |
| BBQ (Ambiguous Bias) | other | scored | 0.0 | - | self-reported |
| BBQ (Disambiguated Bias) | other | scored | -0.7 | - | self-reported |
| BBQ (Disambiguated Bias) | other | scored | -0.7 | - | self-reported |
| BBQ (Disambiguated Bias) | other | scored | -1.6 | - | self-reported |
| BBQ (Disambiguated Bias) | other | scored | -1.7 | - | self-reported |
| Terminal-bench 2.0 | other | mentioned | — | - | self-reported |
| OpenAI MRCR v2 | other | mentioned | — | extended-thinking, rlhf | self-reported |
| Michelangelo | other | cited | — | - | self-reported |
| MathArena | other | cited | — | - | self-reported |
| Election Integrity Benchmark | other | mentioned | — | - | self-reported |
| Dyno Therapeutics Sequence Design Challenge | other | mentioned | — | - | self-reported |
| GPQA-Diamond | reasoning | scored | 94.2 | extended-thinking, rlhf | self-reported |
| GPQA-Diamond | reasoning | scored | 94.2 | - | self-reported |
| GPQA-Diamond | reasoning | scored | 91.3 | - | self-reported |
| ARC-AGI-2 | reasoning | scored | 75.8 | - | self-reported |
| ARC-AGI-2 | reasoning | scored | 68.8 | - | self-reported |