Model Cards / Meta AI

Llama 3.1 Model Card

model card3,304 words·14 min read·Mar 31, 2026·Source
Summary

Llama 3.1 Model Card

A 570-word brief of a 3,304-word document. Published by Meta AI. Version dated Mar 31, 2026.
01

What this is

Meta AI released Llama 3.1 on July 23, 2024, as a collection of multilingual large language models in 8B, 70B, and 405B parameter sizes, superseding Llama 3. All variants accept multilingual text input and produce multilingual text and code output. Instruction-tuned versions are optimized for assistant-like dialogue; pretrained versions support broader natural language generation and downstream adaptation including synthetic data generation and distillation.

02

Capabilities

All three model sizes support a 128k token context window and 8 languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai, with pretraining on 15T+ tokens through a December 2023 knowledge cutoff. The 405B instruct model scores 87.3 on MMLU (5-shot), 89.0 on HumanEval (pass@1), 96.8 on GSM-8K (CoT), and 92.0 on API-Bank tool-use. The 70B instruct model reaches 86.0 on MMLU (CoT) and 95.1 on GSM-8K (CoT).

03

Evaluation methodology

Meta used an internal evaluations library across standard benchmarks including MMLU, HumanEval, GPQA, and multilingual MGSM; raw evaluation data is released publicly on Hugging Face. Adversarial evaluation datasets were constructed for common use cases (chatbot, coding assistant, tool calls) and for specific capabilities such as long context, multilingual handling, and memorization. Safety evaluations tested systems composed of Llama models paired with Llama Guard 3 filtering both input prompts and output responses.

04

Safety testing

Recurring red-team exercises were conducted by experts in cybersecurity, adversarial machine learning, responsible AI, and multilingual content integrity. CBRN uplift testing assessed "whether use of Llama 3.1 models could meaningfully increase the capabilities of malicious actors to plan or carry out attacks" using chemical or biological weapons. A separate cyber automation study evaluated Llama 3.1 405B as an autonomous agent in ransomware scenarios, and a social engineering study assessed its effectiveness for spear phishing; results are detailed in a companion cybersecurity whitepaper. Child safety assessments used objective-based methodologies across multiple attack vectors and all supported languages.

05

Mitigations

Instruction-tuned models are aligned using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). Safety fine-tuning draws on human-generated vendor data plus over 25M synthetically generated examples, filtered by LLM-based classifiers for quality and safety. Meta provides Llama Guard 3, Prompt Guard, and Code Shield as system-level safeguards, included by default in reference implementations. Refusal training covers both borderline and adversarial prompts and follows defined tone guidelines.

06

Deployment and access

Models are released under the Llama 3.1 Community License, a custom commercial license permitting commercial and research use including synthetic data generation and distillation. Use is prohibited for activities that violate applicable laws, the Acceptable Use Policy, or that involve unsupported languages without appropriate fine-tuning and system controls. Meta states that models "are not designed to be deployed in isolation" and expects developers to implement additional safeguards appropriate to their use case.

07

Limitations

Meta states "testing conducted to date has not covered, nor could it cover, all scenarios," and that the model "may in some instances produce inaccurate, biased or other objectionable responses." Multilingual output in languages beyond the 8 supported may fall below safety and helpfulness performance thresholds. The card does not disclose quantitative safety benchmark pass/fail rates or specific refusal rate measurements.

08

What's new

Relative to Llama 3, Llama 3.1 introduces a 128k token context window, explicit multilingual support across 8 languages, substantially improved tool-use capabilities, and the addition of a 405B parameter model. Fine-tuning data now includes over 25M synthetically generated examples. The 8B instruct MATH (CoT) score rose from 29.1 to 51.9, and API-Bank tool-use accuracy for the 8B instruct model rose from 48.3 to 82.6 compared to Llama 3 8B Instruct.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA 1df4999f789b.

Extracted Evaluations(4 results)

Sort by:4 evals
BenchmarkCategoryStateScoreVariantSource
MMLUgeneral_knowledgescored66.75-shotself-reported
MGSMmultilingualscored68.90-shot CoTself-reported
Multilingual MMLUmultilingualscored62.15-shot Portugueseself-reported
DROPreasoningscored58.4%3-shotself-reported