Model Cards / OpenAI

GPT-4o System Card

model card11,659 words·51 min read·Mar 31, 2026·Source
Summary

GPT-4o System Card

A 682-word brief of a 11,659-word document. Published by OpenAI. Version dated Mar 31, 2026.
01

What this is

GPT-4o is an autoregressive omni model released by OpenAI on August 8, 2024, accepting any combination of text, audio, image, and video as input and generating text, audio, and image outputs. It is trained end-to-end across all modalities through a single neural network, replacing the pipeline approach of prior models. It matches GPT-4 Turbo performance on English text and code, improves significantly on non-English languages, and is 50% cheaper in the API.

02

Capabilities

GPT-4o responds to audio inputs in as little as 232 milliseconds (average 320 ms), comparable to human conversational response time. On MedQA USMLE 4-option 0-shot, it scores 89.4% versus 78.2% for GPT-4 Turbo, exceeding specialized medical models such as Med-Gemini-L 1.0 at 84.0%. On ARC-Easy-Hausa, accuracy rises from 6.1% with GPT-3.5 Turbo to 71.4%. The card does not specify a context window size.

03

Evaluation methodology

Existing text-based evaluation datasets were converted to audio using OpenAI's Voice Engine TTS system; outputs were scored on their text transcriptions, except for voice-specific evaluations such as voice cloning detection. External red teaming ran across four phases from early March to late June 2024, involving more than 100 red teamers speaking 45 languages across 29 countries, progressing from early checkpoints to final candidates tested via the iOS Advanced Voice Mode. The card acknowledges that TTS conversion may not capture the diversity of real user audio inputs (accents, intonation, background noise) and that transcript-based scoring may miss audio-specific artifacts such as background sounds or out-of-distribution voices.

04

Safety testing

OpenAI evaluated GPT-4o under its Preparedness Framework across four risk categories: cybersecurity, CBRN, persuasion, and model autonomy. Cybersecurity scored Low: the model completed 19% of high-school, 0% of collegiate, and 1% of professional CTF challenges. Biological threats scored Low based on uplift studies with experts and novices, with GPT-4o scoring 69% consensus@10 on a tacit knowledge and troubleshooting evaluation. Persuasion scored Medium: the text modality "marginally crossed into medium risk threshold from low risk," exceeding human-written content in 3 of 12 instances, though voice was classified as low risk with AI conversations at 65% of human effect size on opinion shift. Model autonomy scored Low: GPT-4o scored 0% on autonomous replication and adaptation tasks across 100 trials, and Apollo Research concluded it is "unlikely" capable of catastrophic scheming.

05

Mitigations

GPT-4o is post-trained to refuse unauthorized voice generation, speaker identification requests, ungrounded inference about speaker traits, and disallowed content across audio modalities. A streaming output classifier catches 100% of meaningful deviations from approved preset voices, with English precision 0.96 and recall 1.0. The existing moderation classifier runs over text transcriptions of both audio inputs and outputs to block erotic, violent, and other high-severity categories. Pre-training data was filtered using the Moderation API for CSAM, hateful content, violence, and CBRN material.

06

Deployment and access

GPT-4o is available through the ChatGPT interface and the OpenAI API at 50% lower cost than GPT-4 Turbo; Advanced Voice Mode was in limited alpha on iOS at the time of publication. The card does not specify a formal license. OpenAI enforces its Usage Policies — which prohibit intentional deception and circumvention of safeguards — through monitoring in both ChatGPT and the API. External red teaming of the GPT-4o API was described as ongoing at time of writing.

07

Limitations

The card flags degraded safety robustness under low-quality audio, background noise, echoes, and intentional or unintentional interruptions as a known weakness with mitigations described as "nascent or still in development." Red teamers elicited misinformation and conspiracy theories via audio, with concern that audio delivery "may be more persuasive or harmful" than the same content in text. Non-English audio outputs sometimes use a non-native accent, raising concerns about bias. Medical evaluations measure clinical knowledge only and "do not measure utility in real-world workflows," and many benchmarks are described as "increasingly saturated."

08

What's new

GPT-4o extends GPT-4 Turbo with native end-to-end audio and video modalities processed by a single unified neural network rather than a pipeline, and adds low-latency speech-to-speech via Advanced Voice Mode. It substantially improves non-English language performance, for example narrowing the English-to-Hausa ARC-Easy gap from roughly 54 percentage points (GPT-3.5 Turbo) to under 20. The card does not include a formal version changelog beyond these capability deltas.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA 98147ebd552e.

Extracted Evaluations(20 results)

Sort by:20 evals
BenchmarkCategoryStateScoreVariantSource
SWE-benchcodingscored19.0pass@1, n=477self-reported
MMLUgeneral_knowledgescored95.05-shotself-reported
MMLUgeneral_knowledgescored93.00-shotself-reported
OpenAI Research Coding Interviewotherscored95.0pass@100self-reported
OpenAI Interview Multiple Choiceotherscored61.0cons@32self-reported
Uhura-Eval Hausaotherscored32.3-self-reported
ARC-Easy-Hausaotherscored6.1-self-reported
Voice Output Classifierotherscored1.0Englishself-reported
Safety Evaluation - Not Unsafeotherscored0.9textself-reported
Voice Output Classifierotherscored0.9Non-Englishself-reported
Safety Evaluation - Not Unsafeotherscored0.9audioself-reported
Speaker identificationotherscored0.8-self-reported
Safety Evaluation - Not Over-refuseotherscored0.8audioself-reported
Safety Evaluation - Not Over-refuseotherscored0.8textself-reported
MedMCQA Devotherscored0.75-shotself-reported
MedMCQA Devotherscored0.70-shotself-reported
Ungrounded Inference / Sensitive Trait Attributionotherscored0.6-self-reported
METRotherscored0.00/10 trialsself-reported
Autonomous Replication and Adaptation (ARA)otherscored0.0100 trialsself-reported
TruthfulQAsafetyscored28.3-self-reported