Model Card Explorer

Summary

Operator System Card

A 620-word brief of a 5,313-word document. Published by OpenAI. Version dated Mar 31, 2026.

What this is

Operator is OpenAI's Computer-Using Agent (CUA) model, released as a research preview on January 23, 2025. It combines GPT-4o's vision capabilities with reinforcement learning to interpret screenshots and interact with graphical user interfaces via cursor and keyboard, just as a person would. It is designed to perform everyday browser-based tasks — ordering groceries, booking reservations, purchasing tickets — on behalf of users.

Capabilities

Operator operates in browser and desktop GUI environments, using visual perception plus keyboard and cursor outputs. On OSWorld, a benchmark for multimodal agents on open-ended OS tasks, the model scores 38.1%, which OpenAI characterizes as not yet highly reliable for OS automation. The card does not disclose a context window size or general language benchmark scores beyond comparisons to GPT-4o on refusal evaluations.

Evaluation methodology

OpenAI conducted internal red teaming with Safety, Security, and Product teams against an unmitigated model, then added initial safeguards before granting access to external red teamers spanning 20 countries and fluent in two dozen languages. External red teamers tested jailbreaks and prompt injections, using mock websites, databases, and emails to avoid real-world harm; OpenAI notes this constraint means findings "may not fully capture the worst-case real-world risks." Frontier risk categories were assessed under OpenAI's Preparedness Framework, with evals adapted for the computer-use setting.

Safety testing

Under the Preparedness Framework, Operator scores "Low" risk for CBRN biorisk tooling (1% task success rate across a biorisk eval set) and "Low" for model autonomy (no main replication task exceeds 10% pass rate); it inherits GPT-4o's "Medium" persuasion and "Low" cybersecurity ratings. On an internal set of 100 projected user tasks, the unmitigated model produced 13 errors, 5 of which were "to some degree, irreversible or possibly severe," including an email sent to the wrong recipient and an incorrectly dated medication reminder. Prompt injection susceptibility measured 62% with no mitigations, 47% with prompting alone, and 23% after full mitigations; manual review found one truly concerning remaining example, which was caught by the prompt injection monitor.

Mitigations

OpenAI employs a four-layer approach spanning model training, system-level checks, product design, and policy enforcement. Operator refuses 97% of illicit-activity and prohibited-financial-activity tasks on internal eval sets, and achieves 92% recall on confirmation prompts across 607 risky-action scenarios; proactive refusals on high-risk task categories (banking, high-stakes decisions) reach 94% recall. A prompt injection monitor trained on red-team sessions achieves 99% recall and 90% precision, flagging 46 false positives out of 13,704 benign screens. Watch mode automatically pauses execution on high-sensitivity sites (e.g., email) when the user becomes inactive, and a website blocklist prevents navigation to sites that could enable prohibited activities.

Deployment and access

Operator launched as a research preview to a limited user group on January 23, 2025, with plans for gradual broader rollout as monitoring informs improvements. As of March 11, 2025, the underlying CUA model was made available via API under the name computer-use-preview to developers on API Tiers 3–5. All users are subject to OpenAI's Usage Policies; prohibited uses include illicit activity, fraud, impersonation, regulated financial automation (e.g., stock trading), and child sexual abuse material.

Limitations

OpenAI acknowledges that post-deployment novel use cases may produce error patterns not captured in pre-release testing, and that adversaries will craft new prompt injections and jailbreaks that current ML-based defenses cannot fully anticipate. The model performs best on short, repeatable browser tasks and faces challenges with complex environments such as slideshows and calendars. OCR reliability degrades on random-looking strings like DNA sequences, API keys, and cryptocurrency addresses, which contributed to low performance on autonomy and biorisk evals.

What's new

This is the initial system card for Operator; no prior version exists to diff against. One documented post-release update: Section 4.7 was added on March 11, 2025 to cover CUA model API availability (computer-use-preview for Tiers 3–5) and associated incremental risks and mitigations.

Benchmark	Category	State	Score	Setup	Source
	other	scored	100.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	99.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	97.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	94.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	92.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	92.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	90.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	80.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	80.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	79.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	62.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	60.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	55.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	47.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	30.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	23.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	20.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	20.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	13.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	10.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
/ overall	other	scored	1.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	0.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	0.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	0.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
	other	scored	0.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported

Operator System Card

Operator System Card

What this is

Capabilities

Evaluation methodology

Safety testing

Mitigations

Deployment and access

Limitations

What's new

Extracted Evaluations(25 results)