GPT-4o System Card
What this is
GPT-4o is an autoregressive omni model released by OpenAI on August 8, 2024, accepting any combination of text, audio, image, and video as input and generating text, audio, and image outputs. It is trained end-to-end across all modalities through a single neural network, replacing the pipeline approach of prior models. It matches GPT-4 Turbo performance on English text and code, improves significantly on non-English languages, and is 50% cheaper in the API.
Capabilities
GPT-4o responds to audio inputs in as little as 232 milliseconds (average 320 ms), comparable to human conversational response time. On MedQA USMLE 4-option 0-shot, it scores 89.4% versus 78.2% for GPT-4 Turbo, exceeding specialized medical models such as Med-Gemini-L 1.0 at 84.0%. On ARC-Easy-Hausa, accuracy rises from 6.1% with GPT-3.5 Turbo to 71.4%. The card does not specify a context window size.
Evaluation methodology
Existing text-based evaluation datasets were converted to audio using OpenAI's Voice Engine TTS system; outputs were scored on their text transcriptions, except for voice-specific evaluations such as voice cloning detection. External red teaming ran across four phases from early March to late June 2024, involving more than 100 red teamers speaking 45 languages across 29 countries, progressing from early checkpoints to final candidates tested via the iOS Advanced Voice Mode. The card acknowledges that TTS conversion may not capture the diversity of real user audio inputs (accents, intonation, background noise) and that transcript-based scoring may miss audio-specific artifacts such as background sounds or out-of-distribution voices.
Safety testing
OpenAI evaluated GPT-4o under its Preparedness Framework across four risk categories: cybersecurity, CBRN, persuasion, and model autonomy. Cybersecurity scored Low: the model completed 19% of high-school, 0% of collegiate, and 1% of professional CTF challenges. Biological threats scored Low based on uplift studies with experts and novices, with GPT-4o scoring 69% consensus@10 on a tacit knowledge and troubleshooting evaluation. Persuasion scored Medium: the text modality "marginally crossed into medium risk threshold from low risk," exceeding human-written content in 3 of 12 instances, though voice was classified as low risk with AI conversations at 65% of human effect size on opinion shift. Model autonomy scored Low: GPT-4o scored 0% on autonomous replication and adaptation tasks across 100 trials, and Apollo Research concluded it is "unlikely" capable of catastrophic scheming.
Mitigations
GPT-4o is post-trained to refuse unauthorized voice generation, speaker identification requests, ungrounded inference about speaker traits, and disallowed content across audio modalities. A streaming output classifier catches 100% of meaningful deviations from approved preset voices, with English precision 0.96 and recall 1.0. The existing moderation classifier runs over text transcriptions of both audio inputs and outputs to block erotic, violent, and other high-severity categories. Pre-training data was filtered using the Moderation API for CSAM, hateful content, violence, and CBRN material.
Deployment and access
GPT-4o is available through the ChatGPT interface and the OpenAI API at 50% lower cost than GPT-4 Turbo; Advanced Voice Mode was in limited alpha on iOS at the time of publication. The card does not specify a formal license. OpenAI enforces its Usage Policies — which prohibit intentional deception and circumvention of safeguards — through monitoring in both ChatGPT and the API. External red teaming of the GPT-4o API was described as ongoing at time of writing.
Limitations
The card flags degraded safety robustness under low-quality audio, background noise, echoes, and intentional or unintentional interruptions as a known weakness with mitigations described as "nascent or still in development." Red teamers elicited misinformation and conspiracy theories via audio, with concern that audio delivery "may be more persuasive or harmful" than the same content in text. Non-English audio outputs sometimes use a non-native accent, raising concerns about bias. Medical evaluations measure clinical knowledge only and "do not measure utility in real-world workflows," and many benchmarks are described as "increasingly saturated."
What's new
GPT-4o extends GPT-4 Turbo with native end-to-end audio and video modalities processed by a single unified neural network rather than a pipeline, and adds low-latency speech-to-speech via Advanced Voice Mode. It substantially improves non-English language performance, for example narrowing the English-to-Hausa ARC-Easy gap from roughly 54 percentage points (GPT-3.5 Turbo) to under 20. The card does not include a formal version changelog beyond these capability deltas.