Model Card Explorer

Summary

Llama 3.1 Model Card

A 570-word brief of a 3,304-word document. Published by Meta AI. Version dated Mar 31, 2026.

What this is

Meta AI released Llama 3.1 on July 23, 2024, as a collection of multilingual large language models in 8B, 70B, and 405B parameter sizes, superseding Llama 3. All variants accept multilingual text input and produce multilingual text and code output. Instruction-tuned versions are optimized for assistant-like dialogue; pretrained versions support broader natural language generation and downstream adaptation including synthetic data generation and distillation.

Capabilities

All three model sizes support a 128k token context window and 8 languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai, with pretraining on 15T+ tokens through a December 2023 knowledge cutoff. The 405B instruct model scores 87.3 on MMLU (5-shot), 89.0 on HumanEval (pass@1), 96.8 on GSM-8K (CoT), and 92.0 on API-Bank tool-use. The 70B instruct model reaches 86.0 on MMLU (CoT) and 95.1 on GSM-8K (CoT).

Evaluation methodology

Meta used an internal evaluations library across standard benchmarks including MMLU, HumanEval, GPQA, and multilingual MGSM; raw evaluation data is released publicly on Hugging Face. Adversarial evaluation datasets were constructed for common use cases (chatbot, coding assistant, tool calls) and for specific capabilities such as long context, multilingual handling, and memorization. Safety evaluations tested systems composed of Llama models paired with Llama Guard 3 filtering both input prompts and output responses.

Safety testing

Recurring red-team exercises were conducted by experts in cybersecurity, adversarial machine learning, responsible AI, and multilingual content integrity. CBRN uplift testing assessed "whether use of Llama 3.1 models could meaningfully increase the capabilities of malicious actors to plan or carry out attacks" using chemical or biological weapons. A separate cyber automation study evaluated Llama 3.1 405B as an autonomous agent in ransomware scenarios, and a social engineering study assessed its effectiveness for spear phishing; results are detailed in a companion cybersecurity whitepaper. Child safety assessments used objective-based methodologies across multiple attack vectors and all supported languages.

Mitigations

Instruction-tuned models are aligned using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). Safety fine-tuning draws on human-generated vendor data plus over 25M synthetically generated examples, filtered by LLM-based classifiers for quality and safety. Meta provides Llama Guard 3, Prompt Guard, and Code Shield as system-level safeguards, included by default in reference implementations. Refusal training covers both borderline and adversarial prompts and follows defined tone guidelines.

Deployment and access

Models are released under the Llama 3.1 Community License, a custom commercial license permitting commercial and research use including synthetic data generation and distillation. Use is prohibited for activities that violate applicable laws, the Acceptable Use Policy, or that involve unsupported languages without appropriate fine-tuning and system controls. Meta states that models "are not designed to be deployed in isolation" and expects developers to implement additional safeguards appropriate to their use case.

Limitations

Meta states "testing conducted to date has not covered, nor could it cover, all scenarios," and that the model "may in some instances produce inaccurate, biased or other objectionable responses." Multilingual output in languages beyond the 8 supported may fall below safety and helpfulness performance thresholds. The card does not disclose quantitative safety benchmark pass/fail rates or specific refusal rate measurements.

What's new

Relative to Llama 3, Llama 3.1 introduces a 128k token context window, explicit multilingual support across 8 languages, substantially improved tool-use capabilities, and the addition of a 405B parameter model. Fine-tuning data now includes over 25M synthetically generated examples. The 8B instruct MATH (CoT) score rose from 29.1 to 51.9, and API-Bank tool-use accuracy for the 8B instruct model rose from 48.3 to 82.6 compared to Llama 3 8B Instruct.

Benchmark	Category	State	Score	Setup	Source
	agent	scored	56.7 accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
/ plus_plus	coding	scored	86.0% pass at 1	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
	coding	scored	84.8 accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
⚠ 7 others disagree	knowledge	scored	80.4% accuracy	5-shotItalianinstruction-tunedmissing: method	self-reported
⚠ 7 others disagree	knowledge	scored	80.1% accuracy	5-shotPortugueseinstruction-tunedmissing: method	self-reported
⚠ 7 others disagree	knowledge	scored	80.0% accuracy	5-shotSpanishinstruction-tunedmissing: method	self-reported
⚠ 7 others disagree	knowledge	scored	79.8% accuracy	5-shotFrenchinstruction-tunedmissing: method	self-reported
⚠ 7 others disagree	knowledge	scored	79.3% accuracy	5-shotGermaninstruction-tunedmissing: method	self-reported
⚠ 7 others disagree	knowledge	scored	74.5% accuracy	5-shotHindiinstruction-tunedmissing: method	self-reported
⚠ 7 others disagree	knowledge	scored	73.0% accuracy	5-shotThaiinstruction-tunedmissing: method	self-reported
	math	scored	95.1% exact match	8-shotcotinstruction-tunedmissing: language	self-reported
	math	scored	68.0% exact match	0-shotcotinstruction-tunedmissing: language	self-reported
	multilingual	scored	86.9% exact match	0-shotcotAverageinstruction-tuned	self-reported
API-Bank	other	scored	90.0 accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
/ humaneval	other	scored	65.5 pass at 1	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
/ mbpp	other	scored	62.0 pass at 1	0-shotinstruction-tunedmissing: methodmissing: language	self-reported
Gorilla Benchmark/ api_bench	other	scored	29.7 accuracy	0-shotinstruction-tunedmissing: methodmissing: language	self-reported

Llama 3.1 Model Card

Llama 3.1 Model Card

What this is

Capabilities

Evaluation methodology

Safety testing

Mitigations

Deployment and access

Limitations

What's new

Extracted Evaluations(17 results)