Model Card Explorer

Summary

Gemini 2.5 Pro Model Card

A 588-word brief of a 7,198-word document. Published by Google DeepMind. Version dated Mar 31, 2026.

What this is

Gemini 2.5 Pro is a natively multimodal reasoning model from Google DeepMind, described as the next iteration in the Gemini 2.0 series and Google's most advanced model for complex tasks. The model card was last updated June 27, 2025, reflecting the model's transition to general availability (GA). It supersedes Gemini 2.5 Pro Experimental (03-25) and Gemini 2.5 Pro Preview (05-06), which are also documented in the same card.

Capabilities

The model accepts text, images, audio, and video inputs with a 1M token context window and produces up to 64K tokens of text output. On Humanity's Last Exam it scores 21.6%; on GPQA Diamond (pass@1) 86.4%; on AIME 2025 (pass@1) 88.0%; and on Aider Polyglot 82.2% diff-fenced. SWE-bench Verified scores 59.6% single attempt and 67.2% multiple attempts. The architecture is a sparse mixture-of-experts transformer, which decouples total model capacity from per-token computation cost.

Evaluation methodology

Gemini results were run via the AI Studio API with default sampling settings, and multiple trials were averaged for smaller benchmarks to reduce variance. Non-Gemini results were sourced from providers' self-reported numbers unless otherwise noted; Vibe-Eval used Gemini as judge. The MRCR v2 methodology changed to a harder 8-needle evaluation going forward, making those results not directly comparable to previously published figures. Benchmark contamination controls are not explicitly described in the card.

Safety testing

Google DeepMind conducted human red teaming by specialist teams, automated red teaming at scale, continuous training-phase evaluations, and independent assurance evaluations held out from the model team to prevent overfitting. The internal Responsibility and Safety Council reviewed capability assessments and made release decisions. FSF evaluations covered four domains—CBRN, cybersecurity, ML R&D, and deceptive alignment—and no Critical Capability Levels (CCLs) were reached across any domain for any version. The Cyber Uplift Level 1 alert threshold was reached in earlier versions, triggering a response plan with more frequent testing and accelerated mitigations; the GA version did not reach this threshold. Deceptive Alignment evaluations for Gemini 2.5 Pro GA are stated as "not yet completed."

Mitigations

Safety mitigations span the full training and deployment lifecycle: dataset filtering, conditional pre-training, supervised fine-tuning, reinforcement learning from human and critic feedback, safety policies and desiderata, and product-level safety filtering. In response to the Cyber Uplift Level 1 alert threshold being reached in earlier versions, an active response plan involving higher-frequency testing and accelerated mitigations is in place.

Deployment and access

Gemini 2.5 Pro reached general availability as of June 2025 and is accessible via the AI Studio API. Three versioned model IDs are documented: gemini-2.5-pro-preview-06-05 (GA), gemini-2.5-pro-preview-05-06, and gemini-2.5-pro-exp-03-25. The card does not disclose specific license terms, pricing tiers, or detailed access restrictions.

Limitations

The lab flags hallucinations, weak causal understanding, complex logical deduction failures, and limited counterfactual reasoning as known limitations of the model. The knowledge cutoff is January 2025. Known safety limitations include over-refusals and responses that can still come across as "preachy," though tone and instruction following have improved over Gemini 1.5. A slight increase in image-to-text safety violation rates (+1.8% versus Gemini 1.5 Pro 002) was observed for the GA version; manual review found losses "overwhelmingly either false positives or not egregious."

What's new

The June 27, 2025 update added GA evaluation results alongside existing Experimental (03-25) and Preview (05-06) data, and changed the model's deployment status to general availability. The benchmark comparison table was expanded to include Claude 4 Opus, Claude 4 Sonnet, and DeepSeek R1 05-28. The MRCR v2 benchmark was updated to a harder 8-needle variant, breaking comparability with prior results. No new CCL alert thresholds were reached in GA evaluations, though Deceptive Alignment testing for the GA version remains incomplete.

Category	State	Score	Setup	Source
coding	scored	67.2%	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
coding	scored	59.6%	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
knowledge	scored	89.2%	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
math	scored	88.0% pass at 1	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	87.8	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	86.9	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	82.2	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	69.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	67.2	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	54.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	21.6	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	18.4	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	14.8	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	1.8	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	-0.9	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
other	scored	-3.5	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
reasoning	scored	86.4 pass at 1	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
vision	scored	83.6%	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
vision	scored	82.0% pass at 1	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported

Gemini 2.5 Pro Model Card

Gemini 2.5 Pro Model Card

What this is

Capabilities

Evaluation methodology

Safety testing

Mitigations

Deployment and access

Limitations

What's new

Extracted Evaluations(19 results)