GPT-4o System Card

model card11,659 words·51 min read·Mar 31, 2026·Source

Summary

GPT-4o System Card

A 682-word brief of a 11,659-word document. Published by OpenAI. Version dated Mar 31, 2026.

What this is

GPT-4o is an autoregressive omni model released by OpenAI on August 8, 2024, accepting any combination of text, audio, image, and video as input and generating text, audio, and image outputs. It is trained end-to-end across all modalities through a single neural network, replacing the pipeline approach of prior models. It matches GPT-4 Turbo performance on English text and code, improves significantly on non-English languages, and is 50% cheaper in the API.

Capabilities

GPT-4o responds to audio inputs in as little as 232 milliseconds (average 320 ms), comparable to human conversational response time. On MedQA USMLE 4-option 0-shot, it scores 89.4% versus 78.2% for GPT-4 Turbo, exceeding specialized medical models such as Med-Gemini-L 1.0 at 84.0%. On ARC-Easy-Hausa, accuracy rises from 6.1% with GPT-3.5 Turbo to 71.4%. The card does not specify a context window size.

Evaluation methodology

Existing text-based evaluation datasets were converted to audio using OpenAI's Voice Engine TTS system; outputs were scored on their text transcriptions, except for voice-specific evaluations such as voice cloning detection. External red teaming ran across four phases from early March to late June 2024, involving more than 100 red teamers speaking 45 languages across 29 countries, progressing from early checkpoints to final candidates tested via the iOS Advanced Voice Mode. The card acknowledges that TTS conversion may not capture the diversity of real user audio inputs (accents, intonation, background noise) and that transcript-based scoring may miss audio-specific artifacts such as background sounds or out-of-distribution voices.

Safety testing

OpenAI evaluated GPT-4o under its Preparedness Framework across four risk categories: cybersecurity, CBRN, persuasion, and model autonomy. Cybersecurity scored Low: the model completed 19% of high-school, 0% of collegiate, and 1% of professional CTF challenges. Biological threats scored Low based on uplift studies with experts and novices, with GPT-4o scoring 69% consensus@10 on a tacit knowledge and troubleshooting evaluation. Persuasion scored Medium: the text modality "marginally crossed into medium risk threshold from low risk," exceeding human-written content in 3 of 12 instances, though voice was classified as low risk with AI conversations at 65% of human effect size on opinion shift. Model autonomy scored Low: GPT-4o scored 0% on autonomous replication and adaptation tasks across 100 trials, and Apollo Research concluded it is "unlikely" capable of catastrophic scheming.

Mitigations

GPT-4o is post-trained to refuse unauthorized voice generation, speaker identification requests, ungrounded inference about speaker traits, and disallowed content across audio modalities. A streaming output classifier catches 100% of meaningful deviations from approved preset voices, with English precision 0.96 and recall 1.0. The existing moderation classifier runs over text transcriptions of both audio inputs and outputs to block erotic, violent, and other high-severity categories. Pre-training data was filtered using the Moderation API for CSAM, hateful content, violence, and CBRN material.

Deployment and access

GPT-4o is available through the ChatGPT interface and the OpenAI API at 50% lower cost than GPT-4 Turbo; Advanced Voice Mode was in limited alpha on iOS at the time of publication. The card does not specify a formal license. OpenAI enforces its Usage Policies — which prohibit intentional deception and circumvention of safeguards — through monitoring in both ChatGPT and the API. External red teaming of the GPT-4o API was described as ongoing at time of writing.

Limitations

The card flags degraded safety robustness under low-quality audio, background noise, echoes, and intentional or unintentional interruptions as a known weakness with mitigations described as "nascent or still in development." Red teamers elicited misinformation and conspiracy theories via audio, with concern that audio delivery "may be more persuasive or harmful" than the same content in text. Non-English audio outputs sometimes use a non-native accent, raising concerns about bias. Medical evaluations measure clinical knowledge only and "do not measure utility in real-world workflows," and many benchmarks are described as "increasingly saturated."

What's new

GPT-4o extends GPT-4 Turbo with native end-to-end audio and video modalities processed by a single unified neural network rather than a pipeline, and adds low-latency speech-to-speech via Advanced Voice Mode. It substantially improves non-English language performance, for example narrowing the English-to-Hausa ARC-Easy gap from roughly 54 percentage points (GPT-3.5 Turbo) to under 20. The card does not include a formal version changelog beyond these capability deltas.

Extracted Evaluations(51 results)

Sort by:0/51 rows fully reproducible (0%)

Benchmark	Category	State	Score	Setup	Source
/ verified	coding	scored	19.0% pass at 1	with-toolsmissing: shot countmissing: languagemissing: training state	self-reported
/ medical_genetics	knowledge	scored	1.0% accuracy	0-shotmissing: methodmissing: languagemissing: training state	self-reported
/ college_biology	knowledge	scored	0.9% accuracy	0-shotmissing: methodmissing: languagemissing: training state	self-reported
/ college_biology	knowledge	scored	0.9% accuracy	5-shotmissing: methodmissing: languagemissing: training state	self-reported
/ medical_genetics	knowledge	scored	0.9% accuracy	5-shotmissing: methodmissing: languagemissing: training state	self-reported
/ professional_medicine	knowledge	scored	0.9% accuracy	0-shotmissing: methodmissing: languagemissing: training state	self-reported
/ professional_medicine	knowledge	scored	0.9% accuracy	5-shotmissing: methodmissing: languagemissing: training state	self-reported
/ clinical_knowledge	knowledge	scored	0.9% accuracy	0-shotmissing: methodmissing: languagemissing: training state	self-reported
/ clinical_knowledge	knowledge	scored	0.9% accuracy	5-shotmissing: methodmissing: languagemissing: training state	self-reported
/ college_medicine	knowledge	scored	0.9% accuracy	5-shotmissing: methodmissing: languagemissing: training state	self-reported
/ anatomy	knowledge	scored	0.9% accuracy	5-shotmissing: methodmissing: languagemissing: training state	self-reported
/ anatomy	knowledge	scored	0.9% accuracy	0-shotmissing: methodmissing: languagemissing: training state	self-reported
/ college_medicine	knowledge	scored	0.8% accuracy	0-shotmissing: methodmissing: languagemissing: training state	self-reported
ARC/ easy	other	scored	94.8 accuracy	0-shotENmissing: methodmissing: training state	self-reported
ARC/ easy	other	scored	86.5 accuracy	0-shotSwahilimissing: methodmissing: training state	self-reported
ARC/ easy	other	scored	75.4 accuracy	0-shotHausamissing: methodmissing: training state	self-reported
ARC/ easy	other	scored	71.4 accuracy	0-shotAmharicmissing: methodmissing: training state	self-reported
ARC/ easy	other	scored	70.0 accuracy	0-shotNorthern Sothomissing: methodmissing: training state	self-reported
OpenAI Biorisk Evaluation/ tacit_knowledge_troubleshooting	other	scored	69.0 consensus at 10	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
ARC/ easy	other	scored	65.8 accuracy	0-shotYorubamissing: methodmissing: training state	self-reported
OpenAI Interview/ multiple_choice	other	scored	61.0 consensus at 32	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
Uhura-Eval	other	scored	60.5 accuracy	0-shotYorubamissing: methodmissing: training state	self-reported
Uhura-Eval	other	scored	59.4 accuracy	0-shotHausamissing: methodmissing: training state	self-reported
Uhura-Eval	other	scored	44.2 accuracy	0-shotAmharicmissing: methodmissing: training state	self-reported
CTF/ high_school	other	scored	19.0 pass at 10	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
CTF/ professional	other	scored	1.0 pass at 10	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
MedQA/ taiwan	other	scored	0.9 accuracy	5-shotmissing: methodmissing: languagemissing: training state	self-reported
MedQA/ taiwan	other	scored	0.9 accuracy	0-shotmissing: methodmissing: languagemissing: training state	self-reported
MedQA/ usmle_4_options	other	scored	0.9 accuracy	5-shotmissing: methodmissing: languagemissing: training state	self-reported
MedQA/ usmle_4_options	other	scored	0.9 accuracy	0-shotmissing: methodmissing: languagemissing: training state	self-reported
MedQA/ usmle_5_options	other	scored	0.9 accuracy	5-shotmissing: methodmissing: languagemissing: training state	self-reported
MedQA/ usmle_5_options	other	scored	0.9 accuracy	0-shotmissing: methodmissing: languagemissing: training state	self-reported
MedQA/ mainland_china	other	scored	0.9 accuracy	5-shotmissing: methodmissing: languagemissing: training state	self-reported
MedQA/ mainland_china	other	scored	0.8 accuracy	0-shotmissing: methodmissing: languagemissing: training state	self-reported
MedMCQA/ dev	other	scored	0.8 accuracy	5-shotmissing: methodmissing: languagemissing: training state	self-reported
MedMCQA/ dev	other	scored	0.8 accuracy	0-shotmissing: methodmissing: languagemissing: training state	self-reported
CTF/ collegiate	other	scored	0.0 pass at 10	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
/ ml_engineering	other	scored	0.0 accuracy	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
ARA	other	scored	0.0 accuracy	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	mentioned	—	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
Instrumental Self-Modification	other	mentioned	—	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
Theory of Mind	other	mentioned	—	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
SAD	other	mentioned	—	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
Instrumental Alignment Faking	other	mentioned	—	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
Theory of Mind Tasks	other	mentioned	—	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	safety	scored	81.4% accuracy	0-shotENmissing: methodmissing: training state	self-reported
	safety	scored	64.4% accuracy	0-shotSwahilimissing: methodmissing: training state	self-reported
	safety	scored	59.2% accuracy	0-shotHausamissing: methodmissing: training state	self-reported
	safety	scored	59.1% accuracy	0-shotNorthern Sothomissing: methodmissing: training state	self-reported
	safety	scored	55.4% accuracy	0-shotAmharicmissing: methodmissing: training state	self-reported
	safety	scored	51.1% accuracy	0-shotYorubamissing: methodmissing: training state	self-reported