Model Cards / Google DeepMind

Gemini Technical Report

model card32,297 words·140 min read·Mar 31, 2026·Source
Summary

Gemini Technical Report

A 702-word brief of a 32,297-word document. Published by Google DeepMind. Version dated Mar 31, 2026.
01

What this is

Gemini is a family of multimodal models from Google DeepMind, released as version 1.0, comprising three sizes — Ultra, Pro, and Nano — designed for tasks ranging from complex reasoning to on-device deployment. The family supersedes PaLM 2 and is trained jointly on text, image, audio, and video. Two post-trained variants are produced: Gemini Apps models for conversational services (Gemini and Gemini Advanced) and Gemini API models for developer access via Google AI Studio and Cloud Vertex AI.

02

Capabilities

Gemini Ultra achieves 90.04% on MMLU with CoT@32, the first model to exceed the human-expert threshold of 89.8%, surpassing the prior state of the art of 86.4%. It scores 94.4% on GSM8K, 53.2% on MATH (4-shot), 74.4% on HumanEval (0-shot), and 62.4% (Maj1@32) on the multimodal benchmark MMMU — more than 5 percentage points above the prior best. The model supports a 32,768-token context window and natively processes interleaved text, image, audio, and video inputs. Nano variants at 1.8B (Nano-1) and 3.25B (Nano-2) parameters target on-device deployment and are 4-bit quantized.

03

Evaluation methodology

Benchmarks span text, code, image, audio, and video, with comparisons to GPT-4, GPT-3.5, PaLM 2-L, Claude 2, and others. Chain-of-thought prompting with k=8 or 32 samples is used for MMLU, selecting a consensus answer above a validation-split threshold or falling back to greedy sampling. The team conducted an "extensive leaked data analysis after training" and excluded benchmarks with detected contamination — LAMBADA results were dropped entirely — while held-out sets such as Natural2Code and WMT23 reduce leakage risk. Human side-by-side blind evaluations and automated reward-model-based evaluators supplement academic benchmark scores.

04

Safety testing

Dangerous-capability evaluations cover offensive cybersecurity (CTF challenges via Bash shell access), persuasion and deception (1-on-1 human participant studies), self-proliferation (resource acquisition and self-improvement tasks), situational awareness, and CBRN risks. On CTF challenges, models solved "various entry-level, tactical challenges" but "all models struggled with challenges involving longer-range exploration and planning." CBRN evaluation using domain-expert human raters across 50 adversarial questions per category found that "the models are unlikely to provide CBRN information that would lead to catastrophic harm." Red teaming on a December 2023 Gemini API Ultra checkpoint found early model versions "vulnerable to simple jailbreak and prompt injection attacks that produce affirmative responses to requests that include promoting violence, self-harm, and dangerous substances."

05

Mitigations

Safety post-training applies SFT and RLHF targeting approximately 20 harm categories, using a data generation recipe "loosely inspired from Constitutional AI" that injects policy language as constitutions to guide response revision. Pre-training data is filtered for harmful content, and safety-specific preference data feeds the RL reward model separately from general quality data. Multimodal safety SFT datasets were specifically created after finding that text-only safety data was "ineffective for harm-inducing queries containing text and images." Safety filters are deployed as default behavior in Cloud Vertex AI and Gemini Advanced, with developer documentation supporting responsible use.

06

Deployment and access

Gemini 1.0 is available through the Gemini consumer service (Pro 1.0) and Gemini Advanced (Ultra 1.0), as well as through Google AI Studio (free, API-key access) and Cloud Vertex AI (enterprise, with built-in security and privacy settings). Access is governed by the Generative AI Prohibited Use Policy, Google Terms of Service, and Generative AI Terms of Service. Internal model cards are created for each approved version; external model and system cards are released in technical report updates and enterprise documentation.

07

Limitations

The team acknowledges that benchmark results "are susceptible to the pretraining dataset composition," and contamination concerns led to dropping LAMBADA results entirely. Image and video models "can make ungrounded inferences" when prompted, though no consistent group-based patterns were observed. Performance is worse on images from lower socioeconomic regions and outside North America and Europe, which the team flags as an area requiring further research. Gemini Ultra was not evaluated on audio tasks at time of publication, with better performance from increased scale expected but untested.

08

What's new

This report describes Gemini 1.0, the first version of the Gemini family, and does not enumerate deltas from a prior Gemini release. The arXiv submission carries version identifier v5 dated May 2025, but no changelog entries appear in the source text. AlphaCode 2, a Gemini Pro-powered competitive programming agent, is introduced alongside this report and reaches an estimated 85th-percentile ranking on Codeforces, compared to the 50th percentile of its predecessor AlphaCode.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA 71d29329e2b6.

Extracted Evaluations(29 results)

Sort by:29 evals
BenchmarkCategoryStateScoreVariantSource
HumanEvalcodingscored74.4instruction-tunedself-reported
MBPPcodingscored20.0-self-reported
MMLUgeneral_knowledgescored90.0CoT with uncertainty routingself-reported
MMLUgeneral_knowledgescored86.4-self-reported
MMLUgeneral_knowledgescored45.95-shotself-reported
GSM8Kmathscored94.4CoT + self-consistencyself-reported
MATHmathscored53.24-shotself-reported
MATHmathscored32.0-self-reported
MGSMmultilingualscored79.08-shotself-reported
MMMUmultimodalscored62.4-self-reported
Key-Value Retrieval (synthetic)otherscored98.0full 32K contextself-reported
Natural2Codeotherscored74.9-self-reported
WMT23 Out-of-Englishotherscored74.81-shotself-reported
WMT23 Mid Resourceotherscored74.71-shotself-reported
WMT23 All Languagesotherscored74.41-shotself-reported
WMT23 High Resourceotherscored74.21-shotself-reported
WMT23 Into-Englishotherscored73.91-shotself-reported
BoolQotherscored71.6-self-reported
TydiQA (GoldP)otherscored68.9-self-reported
Wikilinguaotherscored50.43-shotself-reported
Wikilinguaotherscored48.95-shotself-reported
Wikilinguaotherscored47.8-self-reported
NaturalQuestions (Retrieved)otherscored38.6-self-reported
Low-resource Translation (Flores/NTREX)otherscored27.01-shotself-reported
NaturalQuestions (Closed-book)otherscored18.8-self-reported
XLsumotherscored17.63-shotself-reported
HellaSwagreasoningscored92.31-shotself-reported
HellaSwagreasoningscored89.61-shot (fine-tuned)self-reported
BIG-Bench Hardreasoningscored34.83-shotself-reported