Llama 4 Model Card

model card3,165 words·14 min read·Mar 31, 2026·Source

Summary

Llama 4 Model Card

A 611-word brief of a 3,165-word document. Published by Meta AI. Version dated Mar 31, 2026.

What this is

Llama 4 is a collection of natively multimodal AI models developed by Meta, released April 5, 2025, succeeding the Llama 3.x series. The release comprises two models: Llama 4 Scout (17B activated / 109B total parameters, 16 experts) and Llama 4 Maverick (17B activated / 400B total parameters, 128 experts). Both use a mixture-of-experts (MoE) auto-regressive architecture with early fusion for native multimodality. The models are designed for commercial and research use across text, image, and code tasks in multiple languages.

Capabilities

Both models accept multilingual text and image input and produce multilingual text and code output across 12 supported languages. Scout offers a 10M-token context window; Maverick offers 1M. On instruction-tuned benchmarks, Maverick scores 69.8 on GPQA Diamond, 80.5 on MMLU Pro, 73.4 on MMMU, and 43.4 pass@1 on LiveCodeBench; Scout scores 57.2 on GPQA Diamond, 74.3 on MMLU Pro, and 69.4 on MMMU. Both models score 94.4 ANLS on DocVQA (test). Knowledge cutoff is August 2024.

Evaluation methodology

All reported evaluations were conducted on bf16 models, not quantized checkpoints. Meta built dedicated adversarial evaluation datasets for common use cases (chatbot, visual QA) and evaluated systems composed of Llama models paired with Llama Guard 3 to filter inputs and outputs. Capability-specific benchmarks cover long context, multilingual, coding, and memorization. Recurring red-teaming exercises involve experts in cybersecurity, adversarial machine learning, integrity, and multilingual content from specific geographic markets.

Safety testing

Meta evaluated three critical risk areas. For CBRN, expert-designed evaluations assessed whether Llama 4 could "meaningfully increase the capabilities of malicious actors to plan or carry out attacks" using chemical, biological, radiological, nuclear, or explosive materials. For child safety, a dedicated expert team assessed model outputs, with benchmarks expanded to cover multi-image and multilingual capabilities. For cyber, threat modeling and capability challenges assessed whether Llama 4 could automate attacks or exploit vulnerabilities; Meta reports it finds "that Llama 4 models do not introduce risk plausibly enabling catastrophic cyber outcomes."

Mitigations

Model-level safeguards include safety fine-tuning using human-generated and synthetic data, with LLM-based classifiers for data quality control. Meta reduced refusals to benign prompts and retrained tone to remove preachy or moralizing language. At the system level, Meta provides Llama Guard, Prompt Guard, and Code Shield as open-source tools that developers are directed to deploy alongside the model. The reference implementation includes these protections by default. No ASL or FSF tier is referenced in this document.

Deployment and access

Llama 4 is released under the Llama 4 Community License Agreement, a custom commercial license. Scout is released as BF16 weights and supports on-the-fly int4 quantization to fit a single H100 GPU; Maverick is released in both BF16 and FP8, with FP8 fitting a single H100 DGX host. Use cases prohibited include violations of applicable law, the Acceptable Use Policy, and deployment in languages or capabilities beyond those explicitly supported. Developers extending use to additional languages or beyond 5 input images bear responsibility for safety testing.

Limitations

Meta states that testing "has not covered, nor could it cover, all scenarios" and that Llama 4's "potential outputs cannot be predicted in advance." The model may "produce inaccurate or other objectionable responses to user prompts." Image understanding has been tested for up to 5 input images; use beyond that is at the developer's risk. Pre-training covered approximately 200 languages, but only 12 are officially supported; Meta places responsibility for safe use in additional languages on developers.

What's new

Llama 4 introduces the MoE architecture and native multimodality to the Llama family, both absent from Llama 3.x. Scout's 10M-token context window and Maverick's 1M-token window represent a substantial increase over the 128K context available in Llama 3.1 405B. No version changelog or incremental delta entries are provided in this document beyond the transition from the Llama 3 generation.

Extracted Evaluations(59 results)

Sort by:0/59 rows fully reproducible (0%)

Benchmark	Category	State	Score	Setup	Source
	coding	scored	77.6% pass at 1	3-shotpretrainedmissing: methodmissing: language	self-reported
	coding	scored	74.4% pass at 1	3-shotpretrainedmissing: methodmissing: language	self-reported
	coding	scored	67.8% pass at 1	3-shotpretrainedmissing: methodmissing: language	self-reported
	coding	scored	66.4% pass at 1	3-shotpretrainedmissing: methodmissing: language	self-reported
/ 2024_10_2025_02	coding	scored	43.4 pass at 1	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
/ 2024_10_2025_02	coding	scored	33.3 pass at 1	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
/ 2024_10_2025_02	coding	scored	32.8 pass at 1	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
/ 2024_10_2025_02	coding	scored	27.7 pass at 1	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
	knowledge	scored	85.5% accuracy	5-shotpretrainedmissing: methodmissing: language	self-reported
	knowledge	scored	85.2% accuracy	5-shotpretrainedmissing: methodmissing: language	self-reported
/ pro	knowledge	scored	80.5% accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
	knowledge	scored	79.6% accuracy	5-shotpretrainedmissing: methodmissing: language	self-reported
	knowledge	scored	79.3% accuracy	5-shotpretrainedmissing: methodmissing: language	self-reported
/ pro	knowledge	scored	74.3% accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
/ pro	knowledge	scored	73.4% accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
/ pro	knowledge	scored	68.9% accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
/ pro	knowledge	scored	62.9% exact match	5-shotpretrainedmissing: methodmissing: language	self-reported
/ pro	knowledge	scored	61.6% exact match	5-shotpretrainedmissing: methodmissing: language	self-reported
/ pro	knowledge	scored	58.2% exact match	5-shotpretrainedmissing: methodmissing: language	self-reported
/ pro	knowledge	scored	53.8% exact match	5-shotpretrainedmissing: methodmissing: language	self-reported
	math	scored	61.2% exact match	4-shotpretrainedmissing: methodmissing: language	self-reported
	math	scored	53.5% exact match	4-shotpretrainedmissing: methodmissing: language	self-reported
	math	scored	50.3% exact match	4-shotpretrainedmissing: methodmissing: language	self-reported
	math	scored	41.6% exact match	4-shotpretrainedmissing: methodmissing: language	self-reported
	multilingual	scored	92.3% exact match	0-shotAverageinstruction-tunedmissing: method	self-reported
	multilingual	scored	91.6% exact match	0-shotAverageinstruction-tunedmissing: method	self-reported
	multilingual	scored	91.1% exact match	0-shotAverageinstruction-tunedmissing: method	self-reported
	multilingual	scored	90.6% exact match	0-shotAverageinstruction-tunedmissing: method	self-reported
/ test	multimodal	scored	94.4 anls	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
/ test	multimodal	scored	94.4 anls	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
	multimodal	scored	91.6 anls	0-shotpretrainedmissing: methodmissing: language	self-reported
	multimodal	scored	90.0 relaxed accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
	multimodal	scored	89.4 anls	0-shotpretrainedmissing: methodmissing: language	self-reported
	multimodal	scored	88.8 relaxed accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
	multimodal	scored	85.3 relaxed accuracy	0-shotpretrainedmissing: methodmissing: language	self-reported
	multimodal	scored	83.4 relaxed accuracy	0-shotpretrainedmissing: methodmissing: language	self-reported
	multimodal	scored	73.7 accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
	multimodal	scored	70.7 accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
MTOB/ half_book	other	scored	54.0 chrf	eng->kgvinstruction-tunedmissing: shot countmissing: method	self-reported
MTOB/ full_book	other	scored	50.8 chrf	eng->kgvinstruction-tunedmissing: shot countmissing: method	self-reported
MTOB/ full_book	other	scored	46.7 chrf	kgv->enginstruction-tunedmissing: shot countmissing: method	self-reported
MTOB/ half_book	other	scored	46.4 chrf	kgv->enginstruction-tunedmissing: shot countmissing: method	self-reported
MTOB/ half_book	other	scored	42.2 chrf	eng->kgvinstruction-tunedmissing: shot countmissing: method	self-reported
MTOB/ full_book	other	scored	39.7 chrf	eng->kgvinstruction-tunedmissing: shot countmissing: method	self-reported
MTOB/ half_book	other	scored	36.6 chrf	kgv->enginstruction-tunedmissing: shot countmissing: method	self-reported
MTOB/ full_book	other	scored	36.3 chrf	kgv->enginstruction-tunedmissing: shot countmissing: method	self-reported
TydiQA	other	scored	34.3 f1	1-shotAveragepretrainedmissing: method	self-reported
TydiQA	other	scored	31.7 f1	1-shotAveragepretrainedmissing: method	self-reported
TydiQA	other	scored	31.5 f1	1-shotAveragepretrainedmissing: method	self-reported
TydiQA	other	scored	29.9 f1	1-shotAveragepretrainedmissing: method	self-reported
MLCommons Proof of Concept	other	cited	—	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
/ diamond	reasoning	scored	69.8% accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
/ diamond	reasoning	scored	57.2% accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
/ diamond	reasoning	scored	50.5% accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
/ diamond	reasoning	scored	49.0% accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
	vision	scored	73.4% accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
	vision	scored	69.4% accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
/ pro	vision	scored	59.6% accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
/ pro	vision	scored	52.2% accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported