“GPT-5.5 is a new model designed for complex, real-world work, including writing code, researching online, analyzing information, creating documents and spreadsheets, and moving across tools to get things done. Relative to earlier models, GPT-5.5 understands the task earlier, asks for less guidance, uses tools more effectively, checks it work and keeps going until it’s done.”
| Benchmark | Variant | Score |
|---|---|---|
| Apollo Sandbagging QA | — | 100.0% |
| UK AISI Lower-Difficulty Cyber Tasks | — | 100.0% |
| Atomic Challenges (Irregular) | — | 100.0% |
| Apollo Strategic Deception Capability Sandbagging | — | 99.6% |
| UK AISI Expert Narrow Cyber Tasks | — | 90.5% |
| UK AISI Expert Narrow Cyber Tasks | — | 71.4% |
| Evasion Challenges (Irregular) | — | 54.0% |
| Biochemistry Knowledge | — | 39.3% |
Showing top 8 of 59. See full list below.
- “We are releasing GPT-5.5 with our strongest set of safeguards to date, designed to reduce misuse while preserving legitimate, beneficial uses of advanced capabilities.”
- “we have deployed an expanded set of safeguards to restrict the ability of malicious actors to benefit from increased capabilities in cybersecurity performance (section link to Cyber Safeguards section).”
- “we trained GPT-5.5 to refuse requests that clearly enable unauthorized, destructive, or harmful actions, including areas such as malware deployment, credential theft, and exfiltration.”
- “available on the internet, information that we partner with third parties to access, and information that our users or human trainers and researchers provide or generate.”
Every italicized passage is a verbatim substring of the source document (checked deterministically after extraction). Field selection is heuristic — some quotes may lack surrounding context and some claims may be absent if no matching pattern appeared. For citation, open the source: original model card · source SHA d047a83321e0 · version dated May 3, 2026.
Extracted Evaluations(59 results)
| Benchmark | Category | State | Score | Variant | Source |
|---|---|---|---|---|---|
| BFCL | coding | cited | — | - | self-reported |
| SWE-bench Verified | coding | cited | — | - | self-reported |
| MMLU-Pro | general_knowledge | cited | — | - | self-reported |
| HealthBench | medical | mentioned | — | - | self-reported |
| Apollo Sandbagging QA | other | scored | 100.0 | - | self-reported |
| UK AISI Lower-Difficulty Cyber Tasks | other | scored | 100.0 | - | self-reported |
| Atomic Challenges (Irregular) | other | scored | 100.0 | - | self-reported |
| Apollo Strategic Deception Capability Sandbagging | other | scored | 99.6 | - | self-reported |
| UK AISI Expert Narrow Cyber Tasks | other | scored | 90.5 | - | self-reported |
| UK AISI Expert Narrow Cyber Tasks | other | scored | 71.4 | - | self-reported |
| Evasion Challenges (Irregular) | other | scored | 54.0 | - | self-reported |
| Biochemistry Knowledge | other | scored | 39.3 | - | self-reported |
| Biochemistry Knowledge | other | scored | 32.3 | - | self-reported |
| Biochemistry Knowledge | other | scored | 31.0 | - | self-reported |
| Apollo Impossible Coding Task | other | scored | 29.0 | - | self-reported |
| CyScenarioBench | other | scored | 26.0 | - | self-reported |
| Apollo Verbalized Alignment Evaluation Awareness | other | scored | 22.1 | - | self-reported |
| Apollo Verbalized Alignment Evaluation Awareness | other | scored | 17.3 | - | self-reported |
| DNA Sequence Design for Transcription Factor Binding | other | scored | 16.5 | - | self-reported |
| DNA Sequence Design for Transcription Factor Binding | other | scored | 13.8 | - | self-reported |
| DNA Sequence Design for Transcription Factor Binding | other | scored | 12.8 | - | self-reported |
| Apollo Verbalized Alignment Evaluation Awareness | other | scored | 11.7 | - | self-reported |
| Apollo Impossible Coding Task | other | scored | 10.0 | - | self-reported |
| CyScenarioBench | other | scored | 9.0 | - | self-reported |
| Apollo Impossible Coding Task | other | scored | 7.0 | - | self-reported |
| ML Training Bug Diagnosis | other | scored | 5.8 | - | self-reported |
| Hard Negative Protein Binding Prediction | other | scored | 3.5 | - | self-reported |
| Prompt Injection Attacks in Connectors | other | scored | 1.0 | - | self-reported |
| Cyber Safety Training - Synthetic Data | other | scored | 1.0 | - | self-reported |
| Cyber Safety Training - Synthetic Data | other | scored | 1.0 | - | self-reported |
| Prompt Injection Attacks in Connectors | other | scored | 1.0 | - | self-reported |
| Cyber Safety Training - Synthetic Data | other | scored | 1.0 | - | self-reported |
| Cyber Safety Training - Production Data | other | scored | 1.0 | - | self-reported |
| Prompt Injection Attacks in Connectors | other | scored | 1.0 | - | self-reported |
| Cyber Safety Training - Production Data | other | scored | 1.0 | - | self-reported |
| Cyber Safety Training - Production Data | other | scored | 0.9 | - | self-reported |
| Apollo Sabotage Tasks | other | scored | 0.7 | - | self-reported |
| Prompt Injection Attacks in Connectors | other | scored | 0.6 | - | self-reported |
| CoT-Control | other | scored | 0.5 | - | self-reported |
| Hard Negative Protein Binding Prediction | other | scored | 0.4 | - | self-reported |
| CoT-Control | other | scored | 0.3 | - | self-reported |
| CoT-Control | other | scored | 0.2 | - | self-reported |
| First Person Fairness Evaluation | other | scored | 0.0 | - | self-reported |
| First Person Fairness Evaluation | other | scored | 0.0 | - | self-reported |
| First Person Fairness Evaluation | other | scored | 0.0 | - | self-reported |
| First Person Fairness Evaluation | other | scored | 0.0 | - | self-reported |
| Hard Negative Protein Binding Prediction | other | scored | 0.0 | - | self-reported |
| Nucleobench | other | cited | — | - | self-reported |
| StrongReject | other | mentioned | — | - | self-reported |
| HLE | other | cited | — | - | self-reported |
| TroubleshootingBench | other | mentioned | — | - | self-reported |
| ProtocolQA Open-Ended | other | cited | — | - | self-reported |
| Vulnerability Discovery Benchmark (CAISI) | other | mentioned | — | - | self-reported |
| CTF Challenges (CAISI) | other | mentioned | — | - | self-reported |
| HealthBench Professional | other | mentioned | — | - | self-reported |
| Multiturn Jailbreak Evaluation | other | mentioned | — | - | self-reported |
| ABLE | other | mentioned | — | - | self-reported |
| ABC-Bench | other | mentioned | — | - | self-reported |
| GPQA | reasoning | cited | — | - | self-reported |