Model Cards / Google DeepMind

Gemini 2.5 Flash Model Card

model card2,895 words·13 min read·Mar 31, 2026·Source
Summary

Gemini 2.5 Flash Model Card

A 676-word brief of a 2,895-word document. Published by Google DeepMind. Version dated Mar 31, 2026.
01

What this is

Gemini 2.5 Flash is a natively multimodal reasoning model from Google DeepMind, positioned as the next iteration in the Gemini 2.0 series. The model card was last updated December 2025 and covers three output variants: Gemini 2.5 Flash (text), Gemini 2.5 Flash Image, and Gemini 2.5 Flash Audio. Google describes it as their "first fully hybrid reasoning model," allowing developers to toggle thinking on or off and set thinking budgets to balance quality, cost, and latency.

02

Capabilities

The model accepts text, images, audio, and video inputs with a 1M token context window, and produces text (64K token output), images (32K), or audio (32K) depending on variant. Key benchmark scores for the Preview (09-2025) thinking variant include: GPQA Diamond 80.8%, AIME 2025 75.6%, LiveCodeBench v5 71.7%, MMMU 80.3%, Humanity's Last Exam 13.2%, and FACTS Grounding 87.5%. The architecture is a sparse mixture-of-experts (MoE) transformer with native multimodal support; the MoE design decouples total model capacity from per-token compute cost. Gemini 2.5 Flash Image ranked first on LMArena for both text-to-image and image editing as of August 25, 2025.

03

Evaluation methodology

Gemini results use pass@1 with no majority voting unless stated, run via the AI Studio API across model IDs including gemini-2.5-flash-preview-09-2025, gemini-2.5-flash-preview-05-20, gemini-2.5-flash-preview-04-17, and gemini-2.0-flash with default sampling; smaller benchmarks were averaged over multiple trials to reduce variance. Non-Gemini comparison numbers are sourced from providers' self-reported figures unless otherwise noted; SWE-bench Verified numbers use each provider's own scaffolding. Contamination controls are not explicitly described. Gemini 2.5 Flash Image evaluations used human preference ratings via GenAI-Bench and LMArena alongside automatic prompt-alignment and image-quality metrics.

04

Safety testing

Safety evaluation types included continuous automated and human training-phase evaluations, human red teaming by specialist teams, automated red teaming at scale, independent assurance evaluations, and ethics and safety reviews prior to release. Testing also followed Google DeepMind's Frontier Safety Framework (FSF); however, rather than running a full frontier safety assessment on Flash directly, the card states that "as Gemini 2.5 Flash is less capable than Gemini 2.5 Pro Preview" the FSF results reported in the Pro Preview card provide sufficient confidence that Flash does not reach critical capability levels. Automated safety evaluations relative to Gemini 2.0 Flash show violation rate reductions: text-to-text safety improved by +9.1%, multilingual safety by +12.0%, and image-to-text safety by +6.0% (all labeled non-egregious). Manual review of flagged losses confirmed they were "overwhelmingly either false positives or not egregious," concentrated around creative-use requests for sexually suggestive or hateful content.

05

Mitigations

Safety and responsibility measures span the full training and deployment lifecycle. Mitigations include dataset filtering, conditional pre-training, supervised fine-tuning, reinforcement learning from human and critic feedback, safety policies and desiderata, and product-level safety filtering. The card notes that improved instruction-following training makes the model "more willing to engage with prompts that previous models may have incorrectly refused," with ongoing refinement of automated evaluations to reduce false positives and negatives.

06

Deployment and access

Deployment status is listed as "general availability." The model is accessible via the AI Studio API. No information on licensing terms, pricing tiers, geographic restrictions, or enterprise access controls is disclosed in this document.

07

Limitations

The card flags general foundation-model limitations including hallucinations, weak causal understanding, complex logical deduction, and counterfactual reasoning. Adherence to thinking budgets "may not be consistent." Gemini 2.5 Flash Image may struggle with long-form text rendering and factual representation of fine image details; Gemini 2.5 Flash Audio may exhibit pronunciation errors and voice drift on long multi-turn conversations. The knowledge cutoff for all three variants is January 2025. The main identified safety limitation is tone: the model "will sometimes respond in a way which can come across as 'preachy.'"

08

What's new

Gemini 2.5 Flash is presented as a successor to Gemini 2.0 Flash, adding hybrid reasoning (toggleable thinking and configurable thinking budgets) as a core differentiator. Native image generation and audio output are new modalities described in this card version. The card notes that evaluation processes and benchmark sets have been updated, making results "not directly comparable with performance results found in previous Gemini model cards"; specifically, the MRCR v2 evaluation shifted to a harder 8-needle version and SimpleQA was replaced with SimpleQA Verified.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA 001fc65ad2de.

Extracted Evaluations(41 results)

Sort by:41 evals
BenchmarkCategoryStateScoreVariantSource
SWE-benchcodingscored54.0-self-reported
SWE-benchcodingscored49.2extended reasoningself-reported
MMLUgeneral_knowledgescored87.9-self-reported
AIMEmathscored93.3multiple attempts, extended reasoningself-reported
AIMEmathscored77.3single attempt, pass@1, extended reasoningself-reported
AIMEmathscored75.6single attempt, pass@1self-reported
AIMEmathscored49.5single attempt, pass@1, 64k extended thinkingself-reported
MMMUmultimodalscored80.3single attempt, pass@1self-reported
MMMUmultimodalscored78.0multiple attempts, extended reasoningself-reported
MMMUmultimodalscored76.0single attempt, pass@1, extended reasoningself-reported
MMMUmultimodalscored75.0single attempt, pass@1, 64k extended thinkingself-reported
Overall Preference (LMArena) Image Editingotherscored1362.0-self-reported
Image Editing Character (GenAI-Bench)otherscored1170.0-self-reported
Overall Preference (LMArena) Text-to-Imageotherscored1147.0-self-reported
Image Editing Product Recontextualization (GenAI-Bench)otherscored1128.0-self-reported
Image Editing Creative (GenAI-Bench)otherscored1112.0-self-reported
Image Editing Infographics (GenAI-Bench)otherscored1067.0-self-reported
Image Editing Object / Environment (GenAI-Bench)otherscored1064.0-self-reported
Image Editing Stylization (GenAI-Bench)otherscored1062.0-self-reported
Text-to-Image Alignment (GenAI-Bench)otherscored1042.0-self-reported
FACTS Groundingotherscored87.5-self-reported
LiveCodeBench v5otherscored79.4multiple attempts, extended reasoningself-reported
FACTS Groundingotherscored74.8extended reasoningself-reported
MRCR v2otherscored74.0128k, averageself-reported
LiveCodeBench v5otherscored71.7single attempt, pass@1self-reported
LiveCodeBench v5otherscored70.6single attempt, pass@1, extended reasoningself-reported
Aider Polyglototherscored64.9diff, 32k extended thinkingself-reported
Vibe-Eval (Reka)otherscored62.0-self-reported
Aider Polyglototherscored61.9wholeself-reported
Aider Polyglototherscored56.7diffself-reported
MRCR v2otherscored54.0128k, average, extended reasoningself-reported
Aider Polyglototherscored53.3diff, extended reasoningself-reported
SimpleQAotherscored43.6extended reasoningself-reported
MRCR v2otherscored32.01M, pointwiseself-reported
SimpleQAotherscored25.3-self-reported
Humanity's Last Examotherscored13.2no tools, pass@1self-reported
Humanity's Last Examotherscored8.6no tools, text onlyself-reported
GPQA-Diamondreasoningscored84.8multiple attemptsself-reported
GPQA-Diamondreasoningscored80.8single attempt, pass@1self-reported
GPQA-Diamondreasoningscored80.2single attempt, pass@1, extended reasoningself-reported
GPQA-Diamondreasoningscored78.2single attempt, pass@1, 64k extended thinkingself-reported