Model Cards / Google DeepMind

Gemini 2.5 Deep Think Card

model card6,329 words·28 min read·Mar 31, 2026·Source
Summary

Gemini 2.5 Deep Think Card

A 629-word brief of a 6,329-word document. Published by Google DeepMind. Version dated Mar 31, 2026.
01

What this is

Gemini 2.5 Deep Think is an enhanced reasoning model from Google DeepMind, published August 1, 2025, positioned within the Gemini 2.5 family alongside Gemini 2.5 Pro. It uses parallel thinking and reinforcement learning to test multiple hypotheses simultaneously, targeting problems requiring creativity, strategic planning, and iterative improvement. The underlying architecture is a sparse mixture-of-experts transformer with native multimodal support for text, vision, and audio.

02

Capabilities

The model accepts text, images, audio, and video inputs with a 1M token context window and produces up to 192K tokens of text output. As of July 2025, it scores 34.8% on Humanity's Last Exam (no tools), 60.7% on IMO 2025 (Bronze medal grade, pass@1), 99.2% on AIME 2025, and 87.6% on LiveCodeBench v6 — outperforming Gemini 2.5 Pro, OpenAI o3, and Grok 4 on all four benchmarks. Intended use cases include iterative development, scientific and mathematical discovery, and algorithmic development.

03

Evaluation methodology

Benchmarks compare models without tool calls enabled; Gemini scores are sampled from the Gemini App and averaged over multiple trials for smaller benchmarks. IMO 2025 results use pass@1; other matharena.ai results are best of 32. External leaderboards source results for Humanity's Last Exam, LiveCodeBench, and IMO 2025. Safety evaluations combined automated content scoring, specialist human red teaming, automated red teaming at scale, independent assurance evaluations, and review by Google DeepMind's Responsibility and Safety Council (RSC).

04

Safety testing

Google DeepMind ran full FSF evaluations across four risk domains: CBRN, cybersecurity, ML R&D, and deceptive alignment. For CBRN Uplift Level 1, the assessment is that the model "has enough technical knowledge in certain CBRN scenarios and stages to be considered at early alert threshold"; further study is required to determine whether the CCL is formally reached, and proactive mitigations are in place. For Cyber Uplift Level 1, the early warning alert threshold "was originally reached by Gemini 2.5 Pro and continues to be met by Gemini 2.5 Deep Think," though neither Cyber CCL was reached. ML R&D CCLs were not reached (RE-Bench normalized score 0.96), and Deceptive Alignment CCLs were not reached (3/11 situational awareness, 1/4 stealth challenges solved).

05

Mitigations

Training-time mitigations include dataset filtering, conditional pre-training, supervised fine-tuning, reinforcement learning from human and critic feedback, and safety policies. For CBRN risk specifically, deployed mitigations include model-level and system-level interventions to block dangerous responses, a multi-tier offline usage monitoring system with automated flagging followed by human review from CBRN subject matter experts, account enforcement for confirmed misuse, and ongoing red teaming of the mitigation suite against jailbreaks. Model security mitigations are aligned with RAND SL2, the level required for CBRN Uplift Level 1 per the FSF. An internal safety case covering these mitigations was reviewed by the RSC before launch.

06

Deployment and access

Gemini 2.5 Deep Think is accessible via the Gemini App. Product-level safety filtering is applied at the serving layer. The card does not disclose a public API tier, pricing, or formal license terms.

07

Limitations

The model may exhibit hallucinations common to foundation models and can experience occasional slowness or timeout issues. Its knowledge cutoff is January 2025. The primary content safety limitation is over-refusal: an automated instruction-following evaluation shows a -9.9% regression versus Gemini 2.5 Pro, meaning the model sometimes declines benign requests it should fulfill. CBRN capability assessments are not yet final, and external safety testing for deceptive alignment is described as still ongoing.

08

What's new

Gemini 2.5 Deep Think is a new variant within the Gemini 2.5 family, differentiated from Gemini 2.5 Pro by the addition of parallel thinking and novel reinforcement learning techniques enabling multi-step reasoning, problem-solving, and theorem-proving on curated high-quality mathematics data. The card does not reference a prior version of Deep Think; Gemini 2.5 Pro serves as the primary comparison baseline. Full FSF evaluations were triggered by what the card describes as "exceptional differences" in capability between this model and previously evaluated Gemini 2.5 Pro models.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA 8511d71d4c80.

Extracted Evaluations(19 results)

Sort by:19 evals
BenchmarkCategoryStateScoreVariantSource
AIMEmathscored88.0best of 32self-reported
Autonomous Cyber Offense Suiteotherscored100.0mediumself-reported
Autonomous Cyber Offense Suiteotherscored96.0easyself-reported
Cyber Key Skills Benchmarkotherscored75.0easyself-reported
LiveCodeBench v6otherscored74.2UI: 1/1/2025-5/1/2025self-reported
Cyber Key Skills Benchmarkotherscored60.7mediumself-reported
IMO 2025otherscored60.7pass@1self-reported
Cyber Key Skills Benchmarkotherscored33.3hardself-reported
IMO 2025otherscored31.6-self-reported
Deceptive Alignment - Situational Awarenessotherscored27.3-self-reported
Deceptive Alignment - Stealthotherscored25.0-self-reported
Autonomous Cyber Offense Suiteotherscored23.1hardself-reported
Humanity's Last Examotherscored21.6no toolsself-reported
Toneotherscored16.3relative vs Gemini 2.5 Proself-reported
Image to Text Safetyotherscored2.1non egregious, relative vs Gemini 2.5 Proself-reported
RE-Benchotherscored1.032-hour budgetself-reported
Multilingual Safetyotherscored-1.0relative vs Gemini 2.5 Proself-reported
Instruction Following (Safety)otherscored-9.9relative vs Gemini 2.5 Proself-reported
Text to Text Safetyotherscored-16.3relative vs Gemini 2.5 Proself-reported