Gemini 2.5 Pro Model Card
What this is
Gemini 2.5 Pro is a natively multimodal reasoning model from Google DeepMind, described as the next iteration in the Gemini 2.0 series and Google's most advanced model for complex tasks. The model card was last updated June 27, 2025, reflecting the model's transition to general availability (GA). It supersedes Gemini 2.5 Pro Experimental (03-25) and Gemini 2.5 Pro Preview (05-06), which are also documented in the same card.
Capabilities
The model accepts text, images, audio, and video inputs with a 1M token context window and produces up to 64K tokens of text output. On Humanity's Last Exam it scores 21.6%; on GPQA Diamond (pass@1) 86.4%; on AIME 2025 (pass@1) 88.0%; and on Aider Polyglot 82.2% diff-fenced. SWE-bench Verified scores 59.6% single attempt and 67.2% multiple attempts. The architecture is a sparse mixture-of-experts transformer, which decouples total model capacity from per-token computation cost.
Evaluation methodology
Gemini results were run via the AI Studio API with default sampling settings, and multiple trials were averaged for smaller benchmarks to reduce variance. Non-Gemini results were sourced from providers' self-reported numbers unless otherwise noted; Vibe-Eval used Gemini as judge. The MRCR v2 methodology changed to a harder 8-needle evaluation going forward, making those results not directly comparable to previously published figures. Benchmark contamination controls are not explicitly described in the card.
Safety testing
Google DeepMind conducted human red teaming by specialist teams, automated red teaming at scale, continuous training-phase evaluations, and independent assurance evaluations held out from the model team to prevent overfitting. The internal Responsibility and Safety Council reviewed capability assessments and made release decisions. FSF evaluations covered four domains—CBRN, cybersecurity, ML R&D, and deceptive alignment—and no Critical Capability Levels (CCLs) were reached across any domain for any version. The Cyber Uplift Level 1 alert threshold was reached in earlier versions, triggering a response plan with more frequent testing and accelerated mitigations; the GA version did not reach this threshold. Deceptive Alignment evaluations for Gemini 2.5 Pro GA are stated as "not yet completed."
Mitigations
Safety mitigations span the full training and deployment lifecycle: dataset filtering, conditional pre-training, supervised fine-tuning, reinforcement learning from human and critic feedback, safety policies and desiderata, and product-level safety filtering. In response to the Cyber Uplift Level 1 alert threshold being reached in earlier versions, an active response plan involving higher-frequency testing and accelerated mitigations is in place.
Deployment and access
Gemini 2.5 Pro reached general availability as of June 2025 and is accessible via the AI Studio API. Three versioned model IDs are documented: gemini-2.5-pro-preview-06-05 (GA), gemini-2.5-pro-preview-05-06, and gemini-2.5-pro-exp-03-25. The card does not disclose specific license terms, pricing tiers, or detailed access restrictions.
Limitations
The lab flags hallucinations, weak causal understanding, complex logical deduction failures, and limited counterfactual reasoning as known limitations of the model. The knowledge cutoff is January 2025. Known safety limitations include over-refusals and responses that can still come across as "preachy," though tone and instruction following have improved over Gemini 1.5. A slight increase in image-to-text safety violation rates (+1.8% versus Gemini 1.5 Pro 002) was observed for the GA version; manual review found losses "overwhelmingly either false positives or not egregious."
What's new
The June 27, 2025 update added GA evaluation results alongside existing Experimental (03-25) and Preview (05-06) data, and changed the model's deployment status to general availability. The benchmark comparison table was expanded to include Claude 4 Opus, Claude 4 Sonnet, and DeepSeek R1 05-28. The MRCR v2 benchmark was updated to a harder 8-needle variant, breaking comparability with prior results. No new CCL alert thresholds were reached in GA evaluations, though Deceptive Alignment testing for the GA version remains incomplete.