Model Cards / Google DeepMind

Gemini 2.5 Pro Model Card

model card7,198 words·31 min read·Mar 31, 2026·Source
Summary

Gemini 2.5 Pro Model Card

A 588-word brief of a 7,198-word document. Published by Google DeepMind. Version dated Mar 31, 2026.
01

What this is

Gemini 2.5 Pro is a natively multimodal reasoning model from Google DeepMind, described as the next iteration in the Gemini 2.0 series and Google's most advanced model for complex tasks. The model card was last updated June 27, 2025, reflecting the model's transition to general availability (GA). It supersedes Gemini 2.5 Pro Experimental (03-25) and Gemini 2.5 Pro Preview (05-06), which are also documented in the same card.

02

Capabilities

The model accepts text, images, audio, and video inputs with a 1M token context window and produces up to 64K tokens of text output. On Humanity's Last Exam it scores 21.6%; on GPQA Diamond (pass@1) 86.4%; on AIME 2025 (pass@1) 88.0%; and on Aider Polyglot 82.2% diff-fenced. SWE-bench Verified scores 59.6% single attempt and 67.2% multiple attempts. The architecture is a sparse mixture-of-experts transformer, which decouples total model capacity from per-token computation cost.

03

Evaluation methodology

Gemini results were run via the AI Studio API with default sampling settings, and multiple trials were averaged for smaller benchmarks to reduce variance. Non-Gemini results were sourced from providers' self-reported numbers unless otherwise noted; Vibe-Eval used Gemini as judge. The MRCR v2 methodology changed to a harder 8-needle evaluation going forward, making those results not directly comparable to previously published figures. Benchmark contamination controls are not explicitly described in the card.

04

Safety testing

Google DeepMind conducted human red teaming by specialist teams, automated red teaming at scale, continuous training-phase evaluations, and independent assurance evaluations held out from the model team to prevent overfitting. The internal Responsibility and Safety Council reviewed capability assessments and made release decisions. FSF evaluations covered four domains—CBRN, cybersecurity, ML R&D, and deceptive alignment—and no Critical Capability Levels (CCLs) were reached across any domain for any version. The Cyber Uplift Level 1 alert threshold was reached in earlier versions, triggering a response plan with more frequent testing and accelerated mitigations; the GA version did not reach this threshold. Deceptive Alignment evaluations for Gemini 2.5 Pro GA are stated as "not yet completed."

05

Mitigations

Safety mitigations span the full training and deployment lifecycle: dataset filtering, conditional pre-training, supervised fine-tuning, reinforcement learning from human and critic feedback, safety policies and desiderata, and product-level safety filtering. In response to the Cyber Uplift Level 1 alert threshold being reached in earlier versions, an active response plan involving higher-frequency testing and accelerated mitigations is in place.

06

Deployment and access

Gemini 2.5 Pro reached general availability as of June 2025 and is accessible via the AI Studio API. Three versioned model IDs are documented: gemini-2.5-pro-preview-06-05 (GA), gemini-2.5-pro-preview-05-06, and gemini-2.5-pro-exp-03-25. The card does not disclose specific license terms, pricing tiers, or detailed access restrictions.

07

Limitations

The lab flags hallucinations, weak causal understanding, complex logical deduction failures, and limited counterfactual reasoning as known limitations of the model. The knowledge cutoff is January 2025. Known safety limitations include over-refusals and responses that can still come across as "preachy," though tone and instruction following have improved over Gemini 1.5. A slight increase in image-to-text safety violation rates (+1.8% versus Gemini 1.5 Pro 002) was observed for the GA version; manual review found losses "overwhelmingly either false positives or not egregious."

08

What's new

The June 27, 2025 update added GA evaluation results alongside existing Experimental (03-25) and Preview (05-06) data, and changed the model's deployment status to general availability. The benchmark comparison table was expanded to include Claude 4 Opus, Claude 4 Sonnet, and DeepSeek R1 05-28. The MRCR v2 benchmark was updated to a harder 8-needle variant, breaking comparability with prior results. No new CCL alert thresholds were reached in GA evaluations, though Deceptive Alignment testing for the GA version remains incomplete.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA b45372cc97c4.

Extracted Evaluations(31 results)

Sort by:31 evals
BenchmarkCategoryStateScoreVariantSource
SWE-benchcodingscored67.2multiple attemptsself-reported
SWE-benchcodingscored59.6single attemptself-reported
MMLUgeneral_knowledgescored89.2-self-reported
AIMEmathscored88.0single attempt (pass@1)self-reported
MMMUmultimodalscored83.6-self-reported
MMMUmultimodalscored82.0single attempt (pass@1)self-reported
MMMUmultimodalscored78.0multiple attemptsself-reported
MRCR v2otherscored93.0128k (average)self-reported
FACTS Groundingotherscored87.8-self-reported
Video-MMEotherscored86.9audio, visual, subtitlesself-reported
MRCR v2otherscored82.91M (pointwise)self-reported
Aider Polyglototherscored82.2diff-fencedself-reported
LiveCodeBench (10/1/2024-2/1/2025)otherscored79.4multiple attemptsself-reported
Aider Polyglototherscored76.5wholeself-reported
LiveCodeBench (10/1/2024-2/1/2025)otherscored75.6single attempt (pass@1)self-reported
Aider Polyglototherscored72.7diffself-reported
Aider Polyglototherscored71.6-self-reported
LiveCodeBench (1/1/2025-5/1/2025)otherscored69.0single attemptself-reported
Vibe-Eval (Reka)otherscored67.2-self-reported
SimpleQAotherscored54.0-self-reported
Humanity's Last Examotherscored21.6-self-reported
Toneotherscored18.4relative to Gemini 1.5 Pro 002self-reported
MRCR v2otherscored16.1128k (average), no thinkingself-reported
Instruction Following (Safety)otherscored14.8relative to Gemini 1.5 Pro 002self-reported
Humanity's Last Examotherscored14.0text-onlyself-reported
Image to Text Safetyotherscored1.8relative to Gemini 1.5 Pro 002, non-egregiousself-reported
RE-Benchotherscored0.7average normalised scoreself-reported
Text to Text Safetyotherscored-0.9relative to Gemini 1.5 Pro 002self-reported
Image to Text Safetyotherscored-2.8relative to Gemini 1.5 Pro 002self-reported
Multilingual Safetyotherscored-3.5relative to Gemini 1.5 Pro 002self-reported
GPQA-Diamondreasoningscored86.4single attempt (pass@1)self-reported