Model Card Explorer

Summary

Llama 3.3 Model Card

A 517-word brief of a 2,572-word document. Published by Meta AI. Version dated Mar 31, 2026.

What this is

Llama 3.3 is a 70B-parameter multilingual large language model released by Meta on December 6, 2024. It is an auto-regressive transformer available in pretrained and instruction-tuned variants, trained on approximately 15 trillion tokens with a knowledge cutoff of December 2023. The instruction-tuned version is optimized for multilingual dialogue and is positioned as a successor to Llama 3.1 70B Instruct.

Capabilities

Llama 3.3 70B Instruct scores 86.0 on MMLU, 50.5 on GPQA Diamond, 88.4 pass@1 on HumanEval, 77.0 on MATH, and 91.1 on MGSM multilingual math. It accepts and outputs multilingual text and code across 8 supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. The context window is 128k tokens, with Grouped-Query Attention for inference scalability.

Evaluation methodology

Meta evaluated the model on common use-case benchmarks covering chatbot, coding assistant, and tool-call scenarios using adversarial datasets, with Llama Guard 3 filtering both inputs and outputs. Capability-specific benchmarks addressed long context, multilingual tasks, tool calls, coding, and memorization. Red teaming was conducted on a recurring basis, with findings fed back into benchmark refinement and safety tuning datasets.

Safety testing

Red teaming was performed by experts in cybersecurity, adversarial machine learning, responsible AI, and integrity, alongside multilingual content specialists with market-specific backgrounds. CBRNE uplift testing assessed whether the model could "meaningfully increase the capabilities of malicious actors to plan or carry out attacks" using chemical, biological, radiological, nuclear, or explosive materials. Child safety assessments used objective-based methodologies across multiple attack vectors and supported languages. A cyber attack uplift study evaluated the model as an autonomous agent in offensive operations, specifically ransomware attack simulations without human intervention.

Mitigations

The instruction-tuned model uses supervised fine-tuning and reinforcement learning with human feedback (RLHF). Safety fine-tuning combines human-generated and synthetic data, with LLM-based classifiers selecting high-quality prompts and responses; refusal tone guidelines were explicitly incorporated. Meta provides Llama Guard 3, Prompt Guard, and Code Shield as system-level safeguards, included by default in all reference implementations. The card states the model "is not designed to be deployed in isolation" and that developers are responsible for adding guardrails appropriate to their use case.

Deployment and access

Llama 3.3 is released under a custom Llama 3.3 Community License Agreement permitting commercial and research use. The license covers use of model outputs for synthetic data generation and distillation to improve other models. Use in violation of applicable laws, the Acceptable Use Policy, or in unsupported languages without appropriate safeguards is explicitly out of scope.

Limitations

Meta states that "testing conducted to date has not covered, nor could it cover, all scenarios" and that the model "may in some instances produce inaccurate, biased or other objectionable responses to user prompts." Output behavior cannot be predicted in advance across all deployment contexts. Safety and helpfulness performance is guaranteed only for the 8 officially supported languages; use in other languages is explicitly discouraged without fine-tuning and system controls.

What's new

The card does not include an explicit changelog relative to a prior Llama 3.3 version. Compared to Llama 3.1 70B Instruct, Llama 3.3 70B Instruct shows improvements on MMLU Pro (68.9 vs. 66.4), IFEval (92.1 vs. 87.5), HumanEval (88.4 vs. 80.5), MATH (77.0 vs. 68.0), and MGSM (91.1 vs. 86.9).

Benchmark	Category	State	Score	Setup	Source
	coding	scored	89.0% pass at 1	0-shotENinstruction-tunedmissing: method	self-reported
/ base	coding	scored	88.6% pass at 1	0-shotENinstruction-tunedmissing: method	self-reported
	coding	scored	88.4% pass at 1	0-shotENinstruction-tunedmissing: method	self-reported
/ base	coding	scored	87.6% pass at 1	0-shotENinstruction-tunedmissing: method	self-reported
/ base	coding	scored	86.0% pass at 1	0-shotENinstruction-tunedmissing: method	self-reported
/ v2	coding	scored	81.1 overall ast summary/macro avg/valid	0-shotENinstruction-tunedmissing: method	self-reported
	coding	scored	80.5% pass at 1	0-shotENinstruction-tunedmissing: method	self-reported
/ v2	coding	scored	77.5 overall ast summary/macro avg/valid	0-shotENinstruction-tunedmissing: method	self-reported
/ v2	coding	scored	77.3 overall ast summary/macro avg/valid	0-shotENinstruction-tunedmissing: method	self-reported
/ base	coding	scored	72.8% pass at 1	0-shotENinstruction-tunedmissing: method	self-reported
	coding	scored	72.6% pass at 1	0-shotENinstruction-tunedmissing: method	self-reported
/ v2	coding	scored	65.4 overall ast summary/macro avg/valid	0-shotENinstruction-tunedmissing: method	self-reported
	instruction_following	scored	92.1%	ENinstruction-tunedmissing: shot countmissing: method	self-reported
	instruction_following	scored	88.6%	ENinstruction-tunedmissing: shot countmissing: method	self-reported
	instruction_following	scored	87.5%	ENinstruction-tunedmissing: shot countmissing: method	self-reported
	instruction_following	scored	80.4%	ENinstruction-tunedmissing: shot countmissing: method	self-reported
	knowledge	scored	88.6% accuracy	0-shotcotENinstruction-tuned	self-reported
⚠ 7 others disagree	knowledge	scored	86.0% accuracy	0-shotcotENinstruction-tuned	self-reported
	knowledge	scored	86.0% accuracy	0-shotcotENinstruction-tuned	self-reported
/ pro	knowledge	scored	73.3% accuracy	5-shotcotENinstruction-tuned	self-reported
	knowledge	scored	73.0% accuracy	0-shotcotENinstruction-tuned	self-reported
/ pro	knowledge	scored	68.9% accuracy	5-shotcotENinstruction-tuned	self-reported
/ pro⚠ 7 others disagree	knowledge	scored	66.4% accuracy	5-shotcotENinstruction-tuned	self-reported
/ pro	knowledge	scored	48.3% accuracy	5-shotcotENinstruction-tuned	self-reported
	math	scored	77.0% sympy intersection score	0-shotcotENinstruction-tuned	self-reported
	math	scored	73.8% sympy intersection score	0-shotcotENinstruction-tuned	self-reported
	math	scored	68.0% sympy intersection score	0-shotcotENinstruction-tuned	self-reported
	math	scored	51.9% sympy intersection score	0-shotcotENinstruction-tuned	self-reported
	multilingual	scored	91.6% exact match	0-shotAverageinstruction-tunedmissing: method	self-reported
	multilingual	scored	91.1% exact match	0-shotAverageinstruction-tunedmissing: method	self-reported
	multilingual	scored	86.9% exact match	0-shotAverageinstruction-tunedmissing: method	self-reported
	multilingual	scored	68.9% exact match	0-shotAverageinstruction-tunedmissing: method	self-reported
MLCommons Proof of Concept	other	cited	—	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
/ diamond	reasoning	scored	50.5% accuracy	0-shotcotENinstruction-tuned	self-reported
/ diamond	reasoning	scored	49.0% accuracy	0-shotcotENinstruction-tuned	self-reported
/ diamond	reasoning	scored	48.0% accuracy	0-shotcotENinstruction-tuned	self-reported
/ diamond	reasoning	scored	31.8% accuracy	0-shotcotENinstruction-tuned	self-reported

Llama 3.3 Model Card

Llama 3.3 Model Card

What this is

Capabilities

Evaluation methodology

Safety testing

Mitigations

Deployment and access

Limitations

What's new

Extracted Evaluations(37 results)