Gemini Technical Report
What this is
Gemini is a family of multimodal models from Google DeepMind, released as version 1.0, comprising three sizes — Ultra, Pro, and Nano — designed for tasks ranging from complex reasoning to on-device deployment. The family supersedes PaLM 2 and is trained jointly on text, image, audio, and video. Two post-trained variants are produced: Gemini Apps models for conversational services (Gemini and Gemini Advanced) and Gemini API models for developer access via Google AI Studio and Cloud Vertex AI.
Capabilities
Gemini Ultra achieves 90.04% on MMLU with CoT@32, the first model to exceed the human-expert threshold of 89.8%, surpassing the prior state of the art of 86.4%. It scores 94.4% on GSM8K, 53.2% on MATH (4-shot), 74.4% on HumanEval (0-shot), and 62.4% (Maj1@32) on the multimodal benchmark MMMU — more than 5 percentage points above the prior best. The model supports a 32,768-token context window and natively processes interleaved text, image, audio, and video inputs. Nano variants at 1.8B (Nano-1) and 3.25B (Nano-2) parameters target on-device deployment and are 4-bit quantized.
Evaluation methodology
Benchmarks span text, code, image, audio, and video, with comparisons to GPT-4, GPT-3.5, PaLM 2-L, Claude 2, and others. Chain-of-thought prompting with k=8 or 32 samples is used for MMLU, selecting a consensus answer above a validation-split threshold or falling back to greedy sampling. The team conducted an "extensive leaked data analysis after training" and excluded benchmarks with detected contamination — LAMBADA results were dropped entirely — while held-out sets such as Natural2Code and WMT23 reduce leakage risk. Human side-by-side blind evaluations and automated reward-model-based evaluators supplement academic benchmark scores.
Safety testing
Dangerous-capability evaluations cover offensive cybersecurity (CTF challenges via Bash shell access), persuasion and deception (1-on-1 human participant studies), self-proliferation (resource acquisition and self-improvement tasks), situational awareness, and CBRN risks. On CTF challenges, models solved "various entry-level, tactical challenges" but "all models struggled with challenges involving longer-range exploration and planning." CBRN evaluation using domain-expert human raters across 50 adversarial questions per category found that "the models are unlikely to provide CBRN information that would lead to catastrophic harm." Red teaming on a December 2023 Gemini API Ultra checkpoint found early model versions "vulnerable to simple jailbreak and prompt injection attacks that produce affirmative responses to requests that include promoting violence, self-harm, and dangerous substances."
Mitigations
Safety post-training applies SFT and RLHF targeting approximately 20 harm categories, using a data generation recipe "loosely inspired from Constitutional AI" that injects policy language as constitutions to guide response revision. Pre-training data is filtered for harmful content, and safety-specific preference data feeds the RL reward model separately from general quality data. Multimodal safety SFT datasets were specifically created after finding that text-only safety data was "ineffective for harm-inducing queries containing text and images." Safety filters are deployed as default behavior in Cloud Vertex AI and Gemini Advanced, with developer documentation supporting responsible use.
Deployment and access
Gemini 1.0 is available through the Gemini consumer service (Pro 1.0) and Gemini Advanced (Ultra 1.0), as well as through Google AI Studio (free, API-key access) and Cloud Vertex AI (enterprise, with built-in security and privacy settings). Access is governed by the Generative AI Prohibited Use Policy, Google Terms of Service, and Generative AI Terms of Service. Internal model cards are created for each approved version; external model and system cards are released in technical report updates and enterprise documentation.
Limitations
The team acknowledges that benchmark results "are susceptible to the pretraining dataset composition," and contamination concerns led to dropping LAMBADA results entirely. Image and video models "can make ungrounded inferences" when prompted, though no consistent group-based patterns were observed. Performance is worse on images from lower socioeconomic regions and outside North America and Europe, which the team flags as an area requiring further research. Gemini Ultra was not evaluated on audio tasks at time of publication, with better performance from increased scale expected but untested.
What's new
This report describes Gemini 1.0, the first version of the Gemini family, and does not enumerate deltas from a prior Gemini release. The arXiv submission carries version identifier v5 dated May 2025, but no changelog entries appear in the source text. AlphaCode 2, a Gemini Pro-powered competitive programming agent, is introduced alongside this report and reaches an estimated 85th-percentile ranking on Codeforces, compared to the 50th percentile of its predecessor AlphaCode.