Model Cards / Meta AI

Llama 3.3 Model Card

model card2,572 words·11 min read·Mar 31, 2026·Source
Summary

Llama 3.3 Model Card

A 517-word brief of a 2,572-word document. Published by Meta AI. Version dated Mar 31, 2026.
01

What this is

Llama 3.3 is a 70B-parameter multilingual large language model released by Meta on December 6, 2024. It is an auto-regressive transformer available in pretrained and instruction-tuned variants, trained on approximately 15 trillion tokens with a knowledge cutoff of December 2023. The instruction-tuned version is optimized for multilingual dialogue and is positioned as a successor to Llama 3.1 70B Instruct.

02

Capabilities

Llama 3.3 70B Instruct scores 86.0 on MMLU, 50.5 on GPQA Diamond, 88.4 pass@1 on HumanEval, 77.0 on MATH, and 91.1 on MGSM multilingual math. It accepts and outputs multilingual text and code across 8 supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. The context window is 128k tokens, with Grouped-Query Attention for inference scalability.

03

Evaluation methodology

Meta evaluated the model on common use-case benchmarks covering chatbot, coding assistant, and tool-call scenarios using adversarial datasets, with Llama Guard 3 filtering both inputs and outputs. Capability-specific benchmarks addressed long context, multilingual tasks, tool calls, coding, and memorization. Red teaming was conducted on a recurring basis, with findings fed back into benchmark refinement and safety tuning datasets.

04

Safety testing

Red teaming was performed by experts in cybersecurity, adversarial machine learning, responsible AI, and integrity, alongside multilingual content specialists with market-specific backgrounds. CBRNE uplift testing assessed whether the model could "meaningfully increase the capabilities of malicious actors to plan or carry out attacks" using chemical, biological, radiological, nuclear, or explosive materials. Child safety assessments used objective-based methodologies across multiple attack vectors and supported languages. A cyber attack uplift study evaluated the model as an autonomous agent in offensive operations, specifically ransomware attack simulations without human intervention.

05

Mitigations

The instruction-tuned model uses supervised fine-tuning and reinforcement learning with human feedback (RLHF). Safety fine-tuning combines human-generated and synthetic data, with LLM-based classifiers selecting high-quality prompts and responses; refusal tone guidelines were explicitly incorporated. Meta provides Llama Guard 3, Prompt Guard, and Code Shield as system-level safeguards, included by default in all reference implementations. The card states the model "is not designed to be deployed in isolation" and that developers are responsible for adding guardrails appropriate to their use case.

06

Deployment and access

Llama 3.3 is released under a custom Llama 3.3 Community License Agreement permitting commercial and research use. The license covers use of model outputs for synthetic data generation and distillation to improve other models. Use in violation of applicable laws, the Acceptable Use Policy, or in unsupported languages without appropriate safeguards is explicitly out of scope.

07

Limitations

Meta states that "testing conducted to date has not covered, nor could it cover, all scenarios" and that the model "may in some instances produce inaccurate, biased or other objectionable responses to user prompts." Output behavior cannot be predicted in advance across all deployment contexts. Safety and helpfulness performance is guaranteed only for the 8 officially supported languages; use in other languages is explicitly discouraged without fine-tuning and system controls.

08

What's new

The card does not include an explicit changelog relative to a prior Llama 3.3 version. Compared to Llama 3.1 70B Instruct, Llama 3.3 70B Instruct shows improvements on MMLU Pro (68.9 vs. 66.4), IFEval (92.1 vs. 87.5), HumanEval (88.4 vs. 80.5), MATH (77.0 vs. 68.0), and MGSM (91.1 vs. 86.9).

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA 7aadf60b5bfb.

Extracted Evaluations(9 results)

Sort by:9 evals
BenchmarkCategoryStateScoreVariantSource
MBPPcodingscored72.80-shot, baseself-reported
HumanEvalcodingscored72.60-shotself-reported
BFCLcodingscored65.40-shotself-reported
MMLUgeneral_knowledgescored73.00-shot, CoTself-reported
MMLU-Progeneral_knowledgescored48.35-shot, CoTself-reported
MATHmathscored51.90-shot, CoTself-reported
MGSMmultilingualscored68.90-shotself-reported
IFEvalreasoningscored80.4-self-reported
GPQA-Diamondreasoningscored31.80-shot, CoTself-reported