Model Card Explorer

Summary

Llama 3.1 Technical Paper

A 627-word brief of a 53,127-word document. Published by Meta AI. Version dated Mar 31, 2026.

What this is

Llama 3 — all results in the paper are for the Llama 3.1 release — is a herd of dense Transformer language models from Meta AI, publicly released July 23, 2024, superseding Llama 2. The herd ships in three sizes (8B, 70B, and 405B parameters), with the flagship 405B model supporting a context window of up to 128K tokens. It is designed for multilinguality, coding, reasoning, and tool use, and is released in both pre-trained and post-trained variants under the Llama 3 Community License.

Capabilities

The post-trained 405B model scores 87.3 on MMLU (5-shot), 89.0 on HumanEval (0-shot), 96.8 on GSM8K (8-shot, CoT), 73.8 on MATH (0-shot, CoT), 96.9 on ARC Challenge (0-shot), and 51.1 on GPQA (0-shot, CoT). It answers questions in at least eight languages, generates and executes code across ten priority programming languages, and supports zero-shot and multi-step tool use including a search engine, Python interpreter, and Wolfram Alpha API. The 8B and 70B models are described as best-in-class at their parameter scales, outperforming comparable open models on virtually every benchmark category evaluated.

Evaluation methodology

Pre-trained models are assessed across eight benchmark categories — commonsense reasoning, knowledge, reading comprehension, math and reasoning, long context, code, adversarial, and aggregate — with scores reported alongside 95% confidence intervals computed as 1.96 × √(S(1−S)/N). A contamination analysis following Singh et al. (2024) estimates the extent to which pre-training data overlaps with evaluation sets, though the paper acknowledges that all contamination methods can suffer from false positives and negatives. Post-trained models are additionally evaluated via human comparisons against competing models. Adversarial benchmarks (Adversarial SQuAD, Dynabench SQuAD, GSM-Plus, PAWS) probe robustness and potential benchmark overfitting.

Safety testing

The paper states that "a detailed analysis of the safety of Llama 3" appears in Section 5.4, which is not included in the provided source text. No red-team scope, CBRN evaluations, autonomy thresholds, or specific catastrophic-risk findings are described in the available text. The paper notes the model delivers "a much better balance between helpfulness and harmlessness than its predecessor" and that safety mitigations are incorporated at the post-training stage.

Mitigations

Meta co-releases Llama Guard 3 for input and output safety classification alongside the language models. Pre-training data pipelines filter out domains likely to contain PII, content ranked as harmful under Meta safety standards, and known adult-content domains. A knowledge probing technique generates refusal training data by identifying questions the model answers incorrectly and consistently, aligning the model to "know what it knows" rather than hallucinate. Post-training safety mitigations are incorporated across multiple rounds of supervised finetuning and DPO.

Deployment and access

All three model sizes (8B, 70B, 405B) are publicly released in both pre-trained and post-trained variants at llama.meta.com under the Llama 3 Community License. Multimodal extensions integrating image, video, and speech capabilities are described in the paper but are explicitly noted as "still under development and not yet ready for release." Core tool integrations (search, Python interpreter, Wolfram Alpha) must be individually enabled or disabled via system prompt.

Limitations

Adversarial performance on mathematical reasoning and question answering is substantially lower than non-adversarial performance across all three model sizes; post-training does not close this gap. Contamination analysis methodology is acknowledged as an open research problem susceptible to false positives and negatives. Annealing improvements on GSM8K and MATH are negligible for the 405B model, in contrast to meaningful gains for the 8B model. Multimodal models integrating vision and speech are explicitly flagged as not production-ready.

What's new

Llama 3.1 adds native multilingual support (eight languages), a 128K-token context window, and tool use capabilities not present in the April 2024 Llama 3 8B/70B releases. Pre-training data scales from 1.8T tokens (Llama 2) to 15.6T tokens, and flagship training compute reaches 3.8×10²⁵ FLOPs — approximately 50× the largest Llama 2 model. A new 128K-token vocabulary tokenizer improves English compression from 3.17 to 3.94 characters per token.

Category	State	Score	Setup	Source
agent	scored	38.5	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
coding	scored	76.1	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
coding	scored	72.8%	0-shotmissing: methodmissing: languagemissing: training state	self-reported
coding	scored	72.6%	0-shotmissing: methodmissing: languagemissing: training state	self-reported
coding	scored	67.1% pass at 1	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
coding	scored	62.8%	TypeScriptmissing: shot countmissing: methodmissing: training state	self-reported
coding	scored	60.8% pass at 1	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
coding	scored	58.2%	Javamissing: shot countmissing: methodmissing: training state	self-reported
coding	scored	56.6%	TypeScriptmissing: shot countmissing: methodmissing: training state	self-reported
coding	scored	55.7%	PHPmissing: shot countmissing: methodmissing: training state	self-reported
coding	scored	54.7%	PHPmissing: shot countmissing: methodmissing: training state	self-reported
coding	scored	54.4%	Javamissing: shot countmissing: methodmissing: training state	self-reported
coding	scored	53.7%	C++missing: shot countmissing: methodmissing: training state	self-reported
coding	scored	52.8%	C++missing: shot countmissing: methodmissing: training state	self-reported
coding	scored	43.3%	C#missing: shot countmissing: methodmissing: training state	self-reported
coding	scored	39.2%	Shellmissing: shot countmissing: methodmissing: training state	self-reported
coding	scored	38.0%	C#missing: shot countmissing: methodmissing: training state	self-reported
coding	scored	33.0%	Shellmissing: shot countmissing: methodmissing: training state	self-reported
instruction_following	scored	80.4%	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
knowledge	scored	73.0%	0-shotCoTmissing: languagemissing: training state	self-reported
knowledge	scored	69.4%	5-shotmissing: methodmissing: languagemissing: training state	self-reported
knowledge	scored	48.3%	5-shotCoTmissing: languagemissing: training state	self-reported
long_context	scored	98.8	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
long_context	scored	81.0	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
long_context	scored	65.1	missing: shot countmissing: methodmissing: languagemissing: training state	self-reported
long_context	scored	56.0	5-shotpretrainedmissing: methodmissing: language	self-reported
math	scored	84.5%	8-shotCoTmissing: languagemissing: training state	self-reported
math	scored	60.0%	16-shotpretrainedmissing: methodmissing: language	self-reported
math	scored	51.9%	0-shotCoTmissing: languagemissing: training state	self-reported
multilingual	scored	68.9%	0-shotCoTmissing: languagemissing: training state	self-reported
multilingual	scored	58.6	5-shotmissing: methodmissing: languagemissing: training state	self-reported
reasoning	scored	83.4%	0-shotmissing: methodmissing: languagemissing: training state	self-reported
reasoning	scored	32.8%	0-shotCoTmissing: languagemissing: training state	self-reported

Llama 3.1 Technical Paper

Llama 3.1 Technical Paper

What this is

Capabilities

Evaluation methodology

Safety testing

Mitigations

Deployment and access

Limitations

What's new

Extracted Evaluations(33 results)