GPT-4 System Card
What this is
GPT-4 is the latest large language model in OpenAI's GPT family, trained on internet text to predict the next word and then fine-tuned using reinforcement learning from human feedback (RLHF). This document is a system card that analyzes two specific variants: GPT-4-early, an instruction-following model with minimal safety mitigations, and GPT-4-launch, the version prepared for public deployment with additional safety interventions. The card focuses on safety challenges arising from the model's limitations and capabilities, and the processes OpenAI used to mitigate potential harms. Image capabilities and custom fine-tuning are explicitly out of scope.
Capabilities
GPT-4 demonstrates increased performance in reasoning, knowledge retention, and coding compared to GPT-2 and GPT-3. Its increased coherence allows it to generate content that is "more believable and more persuasive" than prior models, including realistic news articles, tweets, dialogue, and targeted messaging. Red teaming found that GPT-4 "can rival human propagandists in many domains, especially if teamed with a human editor." The model also reduces research time for dual-use information retrieval, with some tasks shortened by several hours compared to traditional search engines.
Evaluation methodology
OpenAI conducted both qualitative and quantitative evaluations beginning in August 2022. More than 50 external experts with backgrounds in fairness, alignment, chemistry, biorisk, cybersecurity, nuclear risks, law, healthcare, and other domains were recruited for iterative adversarial red teaming. Quantitative evaluations used automated classifiers and human analysis to measure model likelihood of generating content violating content policy categories such as hate speech, self-harm advice, and illicit advice. The Alignment Research Center (ARC) conducted a preliminary evaluation of GPT-4's capacity for autonomous replication and resource acquisition. OpenAI acknowledges that expert selection biases toward Western, English-speaking, highly educated participants likely influenced which risks were surfaced.
Safety testing
Red teaming covered hallucinations, harmful content, representation harms, disinformation, weapons proliferation, privacy, cybersecurity, risky emergent behaviors, and economic impacts. Internal adversarial testing of GPT-4-launch was conducted on March 10, 2023. ARC tested whether GPT-4 could autonomously replicate or gather resources, concluding "the current model is probably not yet capable of autonomously doing so." Dual-use stress testing in nuclear, radiological, biological, and chemical weapons domains found that GPT-4 could shorten adversarial research timelines without sacrificing accuracy, though access to the model alone is described as "an insufficient condition for proliferation."
Mitigations
OpenAI reduced harmful content in the pre-training dataset and fine-tuned GPT-4-launch to refuse instructions such as direct requests for illicit advice. The model was trained to reduce hallucination by leveraging data from prior models including ChatGPT, achieving 19 percentage points higher avoidance of open-domain hallucinations and 29 percentage points higher avoidance of closed-domain hallucinations than the latest GPT-3.5 model. Classifiers trained on new risk vectors were incorporated into API monitoring workflows to enforce usage policies. Adversarial prompt exploits, including jailbreaks, were reduced using data from prior model deployments.
Deployment and access
GPT-4 is deployed through OpenAI's API under usage policies that prohibit high-risk government decision-making applications such as law enforcement, criminal justice, and migration, as well as legal or health advice contexts. OpenAI describes its approach as balancing "minimizing risk from deployment, enabling positive use cases, and learning from deployment" through an iterative strategy informed by earlier model releases. Monitoring and policy enforcement are applied at the system level as complements to model-level mitigations. The system card notes that lessons from this deployment are expected to inform future model releases.
Limitations
OpenAI states that mitigations "are limited and remain brittle in some cases," and the system card is explicitly described as not comprehensive. GPT-4-launch continues to exhibit societal biases, reinforce stereotypes, and produce content with potential for harm in adversarial contexts. Refusal-based mitigations can themselves introduce disparate quality-of-service harms across demographic groups and may provide a false sense of assurance. Hallucinations persist and become more dangerous as user trust in the model increases. The card acknowledges that red teaming likely underrepresents risks relevant to non-Western, non-English-speaking populations.
What is new
Compared to earlier GPT models, GPT-4 presents new risk surfaces due to its improved coherence, persuasiveness, and reasoning performance. The system card introduces a structured dual-variant analysis (GPT-4-early versus GPT-4-launch) to isolate the effect of safety mitigations, a methodology not used in prior GPT system cards. The ARC autonomous replication evaluation represents a novel preliminary assessment of an emergent risk category described as speculative but warranting anticipatory study. OpenAI also introduces quantitative, automated safety evaluations run across training checkpoints, enabling faster iteration on safety-relevant model comparisons.