Llama 3.3 Model Card
What this is
Llama 3.3 is a 70B-parameter multilingual large language model released by Meta on December 6, 2024. It is an auto-regressive transformer available in pretrained and instruction-tuned variants, trained on approximately 15 trillion tokens with a knowledge cutoff of December 2023. The instruction-tuned version is optimized for multilingual dialogue and is positioned as a successor to Llama 3.1 70B Instruct.
Capabilities
Llama 3.3 70B Instruct scores 86.0 on MMLU, 50.5 on GPQA Diamond, 88.4 pass@1 on HumanEval, 77.0 on MATH, and 91.1 on MGSM multilingual math. It accepts and outputs multilingual text and code across 8 supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. The context window is 128k tokens, with Grouped-Query Attention for inference scalability.
Evaluation methodology
Meta evaluated the model on common use-case benchmarks covering chatbot, coding assistant, and tool-call scenarios using adversarial datasets, with Llama Guard 3 filtering both inputs and outputs. Capability-specific benchmarks addressed long context, multilingual tasks, tool calls, coding, and memorization. Red teaming was conducted on a recurring basis, with findings fed back into benchmark refinement and safety tuning datasets.
Safety testing
Red teaming was performed by experts in cybersecurity, adversarial machine learning, responsible AI, and integrity, alongside multilingual content specialists with market-specific backgrounds. CBRNE uplift testing assessed whether the model could "meaningfully increase the capabilities of malicious actors to plan or carry out attacks" using chemical, biological, radiological, nuclear, or explosive materials. Child safety assessments used objective-based methodologies across multiple attack vectors and supported languages. A cyber attack uplift study evaluated the model as an autonomous agent in offensive operations, specifically ransomware attack simulations without human intervention.
Mitigations
The instruction-tuned model uses supervised fine-tuning and reinforcement learning with human feedback (RLHF). Safety fine-tuning combines human-generated and synthetic data, with LLM-based classifiers selecting high-quality prompts and responses; refusal tone guidelines were explicitly incorporated. Meta provides Llama Guard 3, Prompt Guard, and Code Shield as system-level safeguards, included by default in all reference implementations. The card states the model "is not designed to be deployed in isolation" and that developers are responsible for adding guardrails appropriate to their use case.
Deployment and access
Llama 3.3 is released under a custom Llama 3.3 Community License Agreement permitting commercial and research use. The license covers use of model outputs for synthetic data generation and distillation to improve other models. Use in violation of applicable laws, the Acceptable Use Policy, or in unsupported languages without appropriate safeguards is explicitly out of scope.
Limitations
Meta states that "testing conducted to date has not covered, nor could it cover, all scenarios" and that the model "may in some instances produce inaccurate, biased or other objectionable responses to user prompts." Output behavior cannot be predicted in advance across all deployment contexts. Safety and helpfulness performance is guaranteed only for the 8 officially supported languages; use in other languages is explicitly discouraged without fine-tuning and system controls.
What's new
The card does not include an explicit changelog relative to a prior Llama 3.3 version. Compared to Llama 3.1 70B Instruct, Llama 3.3 70B Instruct shows improvements on MMLU Pro (68.9 vs. 66.4), IFEval (92.1 vs. 87.5), HumanEval (88.4 vs. 80.5), MATH (77.0 vs. 68.0), and MGSM (91.1 vs. 86.9).