Grok 4 Model Card

model card3,501 words·15 min read·Mar 31, 2026·Source
Summary

Grok 4 Model Card

A 570-word brief of a 3,501-word document. Published by xAI. Version dated Mar 31, 2026.
01

What this is

Grok 4 is xAI's latest reasoning model, last updated August 20, 2025. It offers advanced reasoning and tool-use capabilities and is positioned as a state-of-the-art system across challenging academic and industry benchmarks. xAI deploys it across two surfaces: Grok 4 Web for consumers and Grok 4 API for enterprise use. The card is organized around xAI's Risk Management Framework (RMF), covering abuse potential, concerning propensities, and dual-use capabilities.

02

Capabilities

Grok 4 achieves superhuman performance on biology benchmarks: BioLP-Bench accuracy is 47% (API) and 44% (Web) against a human expert baseline of 38.4%; VCT accuracy is 60% (API) and 71% (Web) against a human expert baseline of 22.1%. WMDP scores reach 87–88% on biology, 83–85% on chemistry, and 79% on cybersecurity (API). CyBench unguided agentic success rate is 43%. Parameter count, context window, and supported modalities are not disclosed.

03

Evaluation methodology

Harmful-query refusal is measured on thousands of queries spanning six languages (English, Spanish, Chinese, Japanese, Arabic, Russian) under three settings: standard, user jailbreak, and system jailbreak; a separate model grades responses. Agentic abuse uses the AgentHarm benchmark; prompt-injection hijacking uses AgentDojo with attack success rate as the primary metric. Deception is assessed via the 1,000-question MASK dataset; sycophancy via Anthropic's answer-sycophancy evaluation; political soft bias via an internal paired-comparison framework. Dual-use knowledge is evaluated on WMDP, BioLP-Bench, and VCT (text-only portions); agentic cyber uses CyBench via the UK AISI Inspect framework; third-party testing assessed end-to-end offensive cyber capability.

04

Safety testing

The card identifies biology as "the area of highest concern," with Grok 4 achieving performance that "significantly exceed[s] human expert baselines" on both BioLP-Bench and VCT. Strong chemistry capabilities are also noted. Radiological and nuclear capabilities are not evaluated; the card argues restricted information access and material controls make AI uplift unlikely. "Third-party testing shows that Grok 4's end-to-end offensive cyber capabilities remain below the level of a human professional." The MASK dishonesty rate is 0.43 and the card states xAI is "exploring further mitigations to reduce propensity for deception."

05

Mitigations

A system prompt encodes xAI's basic refusal policy, instructing the model to decline queries with clear harmful intent including CBRN weapons, CSAM, violent crimes, and hacking. Model-based input filters for both API and Web surfaces reject biological and chemical weapons queries, self-harm, and CSAM. Narrow topically-focused filters are additionally deployed against bioweapons- and chemical-weapons-related abuse given Grok 4's high capability scores. For cyber risks, the refusal policy alone is deemed sufficient; radiological and nuclear risks are addressed only by the system prompt. System prompts for consumer products are published publicly at github.com/xai-org/grok-prompts.

06

Deployment and access

Grok 4 is available via Grok 4 Web (consumer) and Grok 4 API (enterprise), including in the EU. Grok 4 Web does not accept custom system prompts from users. License terms and specific API access tiers or pricing are not disclosed in this card.

07

Limitations

Radiological and nuclear dual-use capabilities are not evaluated, with the card relying on external controls rather than direct measurement. The MASK deception rate remains 0.43 under current mitigations, and the card acknowledges ongoing work. The AgentHarm answer rate of 0.14 in the no-attack agentic setting indicates residual willingness to fulfill harmful agentic tasks. Political soft bias averages 0.36, which the card treats as an area of continuing improvement.

08

What's new

The card describes Grok 4's dual-use capabilities as "a large step up from prior generation models," particularly in biology and cyber. No named predecessor version, explicit changelog, or quantitative comparison table against a prior Grok release is provided in this document.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA 535ef6d35fa6.

Extracted Evaluations(15 results)

Sort by:15 evals
BenchmarkCategoryStateScoreVariantSource
Cybenchagentscored0.4unguidedself-reported
WMDP Biootherscored0.9-self-reported
WMDP Chemotherscored0.8-self-reported
WMDP Cyberotherscored0.8-self-reported
VCTotherscored0.6-self-reported
BioLP-Benchotherscored0.5-self-reported
MASKotherscored0.4-self-reported
Soft Bias (Internal)otherscored0.4-self-reported
AgentHarmotherscored0.1no attackself-reported
MakeMeSayotherscored0.1-self-reported
Sycophancyotherscored0.1-self-reported
AgentDojootherscored0.0-self-reported
Unjustified-refusalsotherscored0.0System Jailbreakself-reported
Unjustified-refusalsotherscored0.0User Jailbreakself-reported
Unjustified-refusalsotherscored0.0standardself-reported