Gemini 2.5 Deep Think Card
What this is
Gemini 2.5 Deep Think is an enhanced reasoning model from Google DeepMind, published August 1, 2025, positioned within the Gemini 2.5 family alongside Gemini 2.5 Pro. It uses parallel thinking and reinforcement learning to test multiple hypotheses simultaneously, targeting problems requiring creativity, strategic planning, and iterative improvement. The underlying architecture is a sparse mixture-of-experts transformer with native multimodal support for text, vision, and audio.
Capabilities
The model accepts text, images, audio, and video inputs with a 1M token context window and produces up to 192K tokens of text output. As of July 2025, it scores 34.8% on Humanity's Last Exam (no tools), 60.7% on IMO 2025 (Bronze medal grade, pass@1), 99.2% on AIME 2025, and 87.6% on LiveCodeBench v6 — outperforming Gemini 2.5 Pro, OpenAI o3, and Grok 4 on all four benchmarks. Intended use cases include iterative development, scientific and mathematical discovery, and algorithmic development.
Evaluation methodology
Benchmarks compare models without tool calls enabled; Gemini scores are sampled from the Gemini App and averaged over multiple trials for smaller benchmarks. IMO 2025 results use pass@1; other matharena.ai results are best of 32. External leaderboards source results for Humanity's Last Exam, LiveCodeBench, and IMO 2025. Safety evaluations combined automated content scoring, specialist human red teaming, automated red teaming at scale, independent assurance evaluations, and review by Google DeepMind's Responsibility and Safety Council (RSC).
Safety testing
Google DeepMind ran full FSF evaluations across four risk domains: CBRN, cybersecurity, ML R&D, and deceptive alignment. For CBRN Uplift Level 1, the assessment is that the model "has enough technical knowledge in certain CBRN scenarios and stages to be considered at early alert threshold"; further study is required to determine whether the CCL is formally reached, and proactive mitigations are in place. For Cyber Uplift Level 1, the early warning alert threshold "was originally reached by Gemini 2.5 Pro and continues to be met by Gemini 2.5 Deep Think," though neither Cyber CCL was reached. ML R&D CCLs were not reached (RE-Bench normalized score 0.96), and Deceptive Alignment CCLs were not reached (3/11 situational awareness, 1/4 stealth challenges solved).
Mitigations
Training-time mitigations include dataset filtering, conditional pre-training, supervised fine-tuning, reinforcement learning from human and critic feedback, and safety policies. For CBRN risk specifically, deployed mitigations include model-level and system-level interventions to block dangerous responses, a multi-tier offline usage monitoring system with automated flagging followed by human review from CBRN subject matter experts, account enforcement for confirmed misuse, and ongoing red teaming of the mitigation suite against jailbreaks. Model security mitigations are aligned with RAND SL2, the level required for CBRN Uplift Level 1 per the FSF. An internal safety case covering these mitigations was reviewed by the RSC before launch.
Deployment and access
Gemini 2.5 Deep Think is accessible via the Gemini App. Product-level safety filtering is applied at the serving layer. The card does not disclose a public API tier, pricing, or formal license terms.
Limitations
The model may exhibit hallucinations common to foundation models and can experience occasional slowness or timeout issues. Its knowledge cutoff is January 2025. The primary content safety limitation is over-refusal: an automated instruction-following evaluation shows a -9.9% regression versus Gemini 2.5 Pro, meaning the model sometimes declines benign requests it should fulfill. CBRN capability assessments are not yet final, and external safety testing for deceptive alignment is described as still ongoing.
What's new
Gemini 2.5 Deep Think is a new variant within the Gemini 2.5 family, differentiated from Gemini 2.5 Pro by the addition of parallel thinking and novel reinforcement learning techniques enabling multi-step reasoning, problem-solving, and theorem-proving on curated high-quality mathematics data. The card does not reference a prior version of Deep Think; Gemini 2.5 Pro serves as the primary comparison baseline. Full FSF evaluations were triggered by what the card describes as "exceptional differences" in capability between this model and previously evaluated Gemini 2.5 Pro models.