Model Card Explorer

Summary

Grok 4 Model Card

A 570-word brief of a 3,501-word document. Published by xAI. Version dated Mar 31, 2026.

What this is

Grok 4 is xAI's latest reasoning model, last updated August 20, 2025. It offers advanced reasoning and tool-use capabilities and is positioned as a state-of-the-art system across challenging academic and industry benchmarks. xAI deploys it across two surfaces: Grok 4 Web for consumers and Grok 4 API for enterprise use. The card is organized around xAI's Risk Management Framework (RMF), covering abuse potential, concerning propensities, and dual-use capabilities.

Capabilities

Grok 4 achieves superhuman performance on biology benchmarks: BioLP-Bench accuracy is 47% (API) and 44% (Web) against a human expert baseline of 38.4%; VCT accuracy is 60% (API) and 71% (Web) against a human expert baseline of 22.1%. WMDP scores reach 87–88% on biology, 83–85% on chemistry, and 79% on cybersecurity (API). CyBench unguided agentic success rate is 43%. Parameter count, context window, and supported modalities are not disclosed.

Evaluation methodology

Harmful-query refusal is measured on thousands of queries spanning six languages (English, Spanish, Chinese, Japanese, Arabic, Russian) under three settings: standard, user jailbreak, and system jailbreak; a separate model grades responses. Agentic abuse uses the AgentHarm benchmark; prompt-injection hijacking uses AgentDojo with attack success rate as the primary metric. Deception is assessed via the 1,000-question MASK dataset; sycophancy via Anthropic's answer-sycophancy evaluation; political soft bias via an internal paired-comparison framework. Dual-use knowledge is evaluated on WMDP, BioLP-Bench, and VCT (text-only portions); agentic cyber uses CyBench via the UK AISI Inspect framework; third-party testing assessed end-to-end offensive cyber capability.

Safety testing

The card identifies biology as "the area of highest concern," with Grok 4 achieving performance that "significantly exceed[s] human expert baselines" on both BioLP-Bench and VCT. Strong chemistry capabilities are also noted. Radiological and nuclear capabilities are not evaluated; the card argues restricted information access and material controls make AI uplift unlikely. "Third-party testing shows that Grok 4's end-to-end offensive cyber capabilities remain below the level of a human professional." The MASK dishonesty rate is 0.43 and the card states xAI is "exploring further mitigations to reduce propensity for deception."

Mitigations

A system prompt encodes xAI's basic refusal policy, instructing the model to decline queries with clear harmful intent including CBRN weapons, CSAM, violent crimes, and hacking. Model-based input filters for both API and Web surfaces reject biological and chemical weapons queries, self-harm, and CSAM. Narrow topically-focused filters are additionally deployed against bioweapons- and chemical-weapons-related abuse given Grok 4's high capability scores. For cyber risks, the refusal policy alone is deemed sufficient; radiological and nuclear risks are addressed only by the system prompt. System prompts for consumer products are published publicly at github.com/xai-org/grok-prompts.

Deployment and access

Grok 4 is available via Grok 4 Web (consumer) and Grok 4 API (enterprise), including in the EU. Grok 4 Web does not accept custom system prompts from users. License terms and specific API access tiers or pricing are not disclosed in this card.

Limitations

Radiological and nuclear dual-use capabilities are not evaluated, with the card relying on external controls rather than direct measurement. The MASK deception rate remains 0.43 under current mitigations, and the card acknowledges ongoing work. The AgentHarm answer rate of 0.14 in the no-attack agentic setting indicates residual willingness to fulfill harmful agentic tasks. Political soft bias averages 0.36, which the card treats as an area of continuing improvement.

What's new

The card describes Grok 4's dual-use capabilities as "a large step up from prior generation models," particularly in biology and cyber. No named predecessor version, explicit changelog, or quantitative comparison table against a prior Grok release is provided in this document.

Benchmark	Category	State	Score	Setup	Source
	agent	scored	0.4 unguided success rate	without-safeguardsinstruction-tunedmissing: shot countmissing: language	self-reported
WMDP/ bio	other	scored	0.9 accuracy	without-safeguardsinstruction-tunedmissing: shot countmissing: language	self-reported
WMDP/ chem	other	scored	0.8 accuracy	without-safeguardsinstruction-tunedmissing: shot countmissing: language	self-reported
WMDP/ cyber	other	scored	0.8 accuracy	without-safeguardsinstruction-tunedmissing: shot countmissing: language	self-reported
/ text_only	other	scored	0.6 accuracy	without-safeguardsinstruction-tunedmissing: shot countmissing: language	self-reported
	other	scored	0.5 accuracy	without-safeguardsinstruction-tunedmissing: shot countmissing: language	self-reported
	other	scored	0.4 dishonesty rate	post-mitigationinstruction-tunedmissing: shot countmissing: language	self-reported
Soft Bias	other	scored	0.4 average bias	post-mitigationinstruction-tunedmissing: shot countmissing: language	self-reported
/ no_attack	other	scored	0.1 answer rate	post-mitigationinstruction-tunedmissing: shot countmissing: language	self-reported
	other	scored	0.1 win rate	without-safeguardsinstruction-tunedmissing: shot countmissing: language	self-reported
/ answer_sycophancy	other	scored	0.1 sycophancy rate	post-mitigationinstruction-tunedmissing: shot countmissing: language	self-reported
	other	scored	0.0 attack success rate	post-mitigationinstruction-tunedmissing: shot countmissing: language	self-reported
xAI Refusal Dataset/ system_jailbreak	other	scored	0.0 answer rate	post-mitigationAverageinstruction-tunedmissing: shot count	self-reported
xAI Refusal Dataset/ standard	other	scored	0.0 answer rate	post-mitigationAverageinstruction-tunedmissing: shot count	self-reported
xAI Refusal Dataset/ user_jailbreak	other	scored	0.0 answer rate	post-mitigationAverageinstruction-tunedmissing: shot count	self-reported

Grok 4 Model Card

Grok 4 Model Card

What this is

Capabilities

Evaluation methodology

Safety testing

Mitigations

Deployment and access

Limitations

What's new

Extracted Evaluations(15 results)