Model Card Explorer

Summary

Gemini 2.5 Deep Think Card

A 629-word brief of a 6,329-word document. Published by Google DeepMind. Version dated Mar 31, 2026.

What this is

Gemini 2.5 Deep Think is an enhanced reasoning model from Google DeepMind, published August 1, 2025, positioned within the Gemini 2.5 family alongside Gemini 2.5 Pro. It uses parallel thinking and reinforcement learning to test multiple hypotheses simultaneously, targeting problems requiring creativity, strategic planning, and iterative improvement. The underlying architecture is a sparse mixture-of-experts transformer with native multimodal support for text, vision, and audio.

Capabilities

The model accepts text, images, audio, and video inputs with a 1M token context window and produces up to 192K tokens of text output. As of July 2025, it scores 34.8% on Humanity's Last Exam (no tools), 60.7% on IMO 2025 (Bronze medal grade, pass@1), 99.2% on AIME 2025, and 87.6% on LiveCodeBench v6 — outperforming Gemini 2.5 Pro, OpenAI o3, and Grok 4 on all four benchmarks. Intended use cases include iterative development, scientific and mathematical discovery, and algorithmic development.

Evaluation methodology

Benchmarks compare models without tool calls enabled; Gemini scores are sampled from the Gemini App and averaged over multiple trials for smaller benchmarks. IMO 2025 results use pass@1; other matharena.ai results are best of 32. External leaderboards source results for Humanity's Last Exam, LiveCodeBench, and IMO 2025. Safety evaluations combined automated content scoring, specialist human red teaming, automated red teaming at scale, independent assurance evaluations, and review by Google DeepMind's Responsibility and Safety Council (RSC).

Safety testing

Google DeepMind ran full FSF evaluations across four risk domains: CBRN, cybersecurity, ML R&D, and deceptive alignment. For CBRN Uplift Level 1, the assessment is that the model "has enough technical knowledge in certain CBRN scenarios and stages to be considered at early alert threshold"; further study is required to determine whether the CCL is formally reached, and proactive mitigations are in place. For Cyber Uplift Level 1, the early warning alert threshold "was originally reached by Gemini 2.5 Pro and continues to be met by Gemini 2.5 Deep Think," though neither Cyber CCL was reached. ML R&D CCLs were not reached (RE-Bench normalized score 0.96), and Deceptive Alignment CCLs were not reached (3/11 situational awareness, 1/4 stealth challenges solved).

Mitigations

Training-time mitigations include dataset filtering, conditional pre-training, supervised fine-tuning, reinforcement learning from human and critic feedback, and safety policies. For CBRN risk specifically, deployed mitigations include model-level and system-level interventions to block dangerous responses, a multi-tier offline usage monitoring system with automated flagging followed by human review from CBRN subject matter experts, account enforcement for confirmed misuse, and ongoing red teaming of the mitigation suite against jailbreaks. Model security mitigations are aligned with RAND SL2, the level required for CBRN Uplift Level 1 per the FSF. An internal safety case covering these mitigations was reviewed by the RSC before launch.

Deployment and access

Gemini 2.5 Deep Think is accessible via the Gemini App. Product-level safety filtering is applied at the serving layer. The card does not disclose a public API tier, pricing, or formal license terms.

Limitations

The model may exhibit hallucinations common to foundation models and can experience occasional slowness or timeout issues. Its knowledge cutoff is January 2025. The primary content safety limitation is over-refusal: an automated instruction-following evaluation shows a -9.9% regression versus Gemini 2.5 Pro, meaning the model sometimes declines benign requests it should fulfill. CBRN capability assessments are not yet final, and external safety testing for deceptive alignment is described as still ongoing.

What's new

Gemini 2.5 Deep Think is a new variant within the Gemini 2.5 family, differentiated from Gemini 2.5 Pro by the addition of parallel thinking and novel reinforcement learning techniques enabling multi-step reasoning, problem-solving, and theorem-proving on curated high-quality mathematics data. The card does not reference a prior version of Deep Think; Gemini 2.5 Pro serves as the primary comparison baseline. Full FSF evaluations were triggered by what the card describes as "exceptional differences" in capability between this model and previously evaluated Gemini 2.5 Pro models.

Benchmark	Category	State	Score	Setup	Source
/ v6	coding	scored	87.6 accuracy	rlhfmissing: shot countmissing: methodmissing: language	self-reported
/ 2025	math	scored	99.2% accuracy	majority-votingrlhfmissing: shot countmissing: language	self-reported
IMO/ 2025	other	scored	60.7 pass at 1	rlhfmissing: shot countmissing: methodmissing: language	self-reported
	other	scored	34.8 accuracy	no-toolsrlhfmissing: shot countmissing: language	self-reported
	other	scored	16.3	rlhfmissing: shot countmissing: methodmissing: language	self-reported
	other	scored	2.1	rlhfmissing: shot countmissing: methodmissing: language	self-reported
/ medium	other	scored	1.0 resolve rate	rlhfmissing: shot countmissing: methodmissing: language	self-reported
/ easy	other	scored	1.0 resolve rate	rlhfmissing: shot countmissing: methodmissing: language	self-reported
	other	scored	1.0 accuracy	rlhfmissing: shot countmissing: methodmissing: language	self-reported
key skills benchmark/ easy	other	scored	0.8 resolve rate	rlhfmissing: shot countmissing: methodmissing: language	self-reported
key skills benchmark/ medium	other	scored	0.6 resolve rate	rlhfmissing: shot countmissing: methodmissing: language	self-reported
key skills benchmark/ hard	other	scored	0.3 resolve rate	rlhfmissing: shot countmissing: methodmissing: language	self-reported
Deceptive Alignment Evaluation/ situational_awareness	other	scored	0.3 resolve rate	rlhfmissing: shot countmissing: methodmissing: language	self-reported
Deceptive Alignment Evaluation/ stealth	other	scored	0.3 resolve rate	rlhfmissing: shot countmissing: methodmissing: language	self-reported
/ hard	other	scored	0.2 resolve rate	rlhfmissing: shot countmissing: methodmissing: language	self-reported
	other	scored	-1.0	Averagerlhfmissing: shot countmissing: method	self-reported
Instruction Following	other	scored	-9.9	rlhfmissing: shot countmissing: methodmissing: language	self-reported
	other	scored	-16.3	ENrlhfmissing: shot countmissing: method	self-reported
InterCode-CTF	other	mentioned	—	rlhfmissing: shot countmissing: methodmissing: language	self-reported
Lab-Bench/ cloning_scenarios	other	mentioned	—	rlhfmissing: shot countmissing: methodmissing: language	self-reported
Lab-Bench/ seq_qa	other	mentioned	—	rlhfmissing: shot countmissing: methodmissing: language	self-reported
WMDP/ biology	other	mentioned	—	rlhfmissing: shot countmissing: methodmissing: language	self-reported
WMDP/ chemistry	other	mentioned	—	rlhfmissing: shot countmissing: methodmissing: language	self-reported
Hack the Box	other	mentioned	—	rlhfmissing: shot countmissing: methodmissing: language	self-reported
SecureBio VMQA	other	mentioned	—	rlhfmissing: shot countmissing: methodmissing: language	self-reported
Lab-Bench/ protocol_qa	other	mentioned	—	rlhfmissing: shot countmissing: methodmissing: language	self-reported

Gemini 2.5 Deep Think Card

Gemini 2.5 Deep Think Card

What this is

Capabilities

Evaluation methodology

Safety testing

Mitigations

Deployment and access

Limitations

What's new

Extracted Evaluations(26 results)