Model Cards / Meta AI

Llama 3.1 Technical Paper

model card53,127 words·231 min read·Mar 31, 2026·Source
Summary

Llama 3.1 Technical Paper

A 627-word brief of a 53,127-word document. Published by Meta AI. Version dated Mar 31, 2026.
01

What this is

Llama 3 — all results in the paper are for the Llama 3.1 release — is a herd of dense Transformer language models from Meta AI, publicly released July 23, 2024, superseding Llama 2. The herd ships in three sizes (8B, 70B, and 405B parameters), with the flagship 405B model supporting a context window of up to 128K tokens. It is designed for multilinguality, coding, reasoning, and tool use, and is released in both pre-trained and post-trained variants under the Llama 3 Community License.

02

Capabilities

The post-trained 405B model scores 87.3 on MMLU (5-shot), 89.0 on HumanEval (0-shot), 96.8 on GSM8K (8-shot, CoT), 73.8 on MATH (0-shot, CoT), 96.9 on ARC Challenge (0-shot), and 51.1 on GPQA (0-shot, CoT). It answers questions in at least eight languages, generates and executes code across ten priority programming languages, and supports zero-shot and multi-step tool use including a search engine, Python interpreter, and Wolfram Alpha API. The 8B and 70B models are described as best-in-class at their parameter scales, outperforming comparable open models on virtually every benchmark category evaluated.

03

Evaluation methodology

Pre-trained models are assessed across eight benchmark categories — commonsense reasoning, knowledge, reading comprehension, math and reasoning, long context, code, adversarial, and aggregate — with scores reported alongside 95% confidence intervals computed as 1.96 × √(S(1−S)/N). A contamination analysis following Singh et al. (2024) estimates the extent to which pre-training data overlaps with evaluation sets, though the paper acknowledges that all contamination methods can suffer from false positives and negatives. Post-trained models are additionally evaluated via human comparisons against competing models. Adversarial benchmarks (Adversarial SQuAD, Dynabench SQuAD, GSM-Plus, PAWS) probe robustness and potential benchmark overfitting.

04

Safety testing

The paper states that "a detailed analysis of the safety of Llama 3" appears in Section 5.4, which is not included in the provided source text. No red-team scope, CBRN evaluations, autonomy thresholds, or specific catastrophic-risk findings are described in the available text. The paper notes the model delivers "a much better balance between helpfulness and harmlessness than its predecessor" and that safety mitigations are incorporated at the post-training stage.

05

Mitigations

Meta co-releases Llama Guard 3 for input and output safety classification alongside the language models. Pre-training data pipelines filter out domains likely to contain PII, content ranked as harmful under Meta safety standards, and known adult-content domains. A knowledge probing technique generates refusal training data by identifying questions the model answers incorrectly and consistently, aligning the model to "know what it knows" rather than hallucinate. Post-training safety mitigations are incorporated across multiple rounds of supervised finetuning and DPO.

06

Deployment and access

All three model sizes (8B, 70B, 405B) are publicly released in both pre-trained and post-trained variants at llama.meta.com under the Llama 3 Community License. Multimodal extensions integrating image, video, and speech capabilities are described in the paper but are explicitly noted as "still under development and not yet ready for release." Core tool integrations (search, Python interpreter, Wolfram Alpha) must be individually enabled or disabled via system prompt.

07

Limitations

Adversarial performance on mathematical reasoning and question answering is substantially lower than non-adversarial performance across all three model sizes; post-training does not close this gap. Contamination analysis methodology is acknowledged as an open research problem susceptible to false positives and negatives. Annealing improvements on GSM8K and MATH are negligible for the 405B model, in contrast to meaningful gains for the 8B model. Multimodal models integrating vision and speech are explicitly flagged as not production-ready.

08

What's new

Llama 3.1 adds native multilingual support (eight languages), a 128K-token context window, and tool use capabilities not present in the April 2024 Llama 3 8B/70B releases. Pre-training data scales from 1.8T tokens (Llama 2) to 15.6T tokens, and flagship training compute reaches 3.8×10²⁵ FLOPs — approximately 50× the largest Llama 2 model. A new 128K-token vocabulary tokenizer improves English compression from 3.17 to 3.94 characters per token.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA 29c9791daa55.

Extracted Evaluations(207 results)

Sort by:207 evals
BenchmarkCategoryStateScoreVariantSource
Nexusagentscored58.7-self-reported
Nexusagentscored56.7-self-reported
Nexusagentscored56.1-self-reported
Nexusagentscored50.3-self-reported
Nexusagentscored48.5-self-reported
Nexusagentscored45.7-self-reported
Nexusagentscored38.5-self-reported
Nexusagentscored37.2-self-reported
Nexusagentscored30.0-self-reported
Nexusagentscored24.7-self-reported
HumanEvalcodingscored92.00-shotself-reported
MBPPcodingscored90.50-shotself-reported
BFCLcodingscored90.2-self-reported
HumanEvalcodingscored90.20-shotself-reported
HumanEvalcodingscored89.00-shotself-reported
MBPPcodingscored88.60-shotself-reported
BFCLcodingscored88.5-self-reported
BFCLcodingscored88.3-self-reported
MBPPcodingscored87.80-shotself-reported
HumanEvalcodingscored86.60-shotself-reported
BFCLcodingscored86.5-self-reported
MBPPcodingscored86.00-shotself-reported
HumanEvalcodingscored86.0pass@1self-reported
BFCLcodingscored85.9-self-reported
BFCLcodingscored84.8-self-reported
MBPPcodingscored83.60-shotself-reported
HumanEvalcodingscored82.3pass@1self-reported
HumanEvalcodingscored82.3pass@1self-reported
HumanEvalcodingscored82.0-self-reported
MBPPcodingscored82.00-shotself-reported
MBPPcodingscored81.4pass@1self-reported
BFCLcodingscored80.5-self-reported
HumanEvalcodingscored80.50-shotself-reported
MBPPcodingscored80.2pass@1self-reported
MBPPcodingscored78.8pass@1self-reported
MBPPcodingscored78.60-shotself-reported
HumanEvalcodingscored77.4pass@1self-reported
MBPPcodingscored76.6pass@1self-reported
BFCLcodingscored76.1-self-reported
HumanEvalcodingscored75.60-shotself-reported
MBPPcodingscored75.4pass@1self-reported
MBPPcodingscored75.4pass@1self-reported
HumanEvalcodingscored74.4pass@1self-reported
HumanEvalcodingscored73.20-shotself-reported
MBPPcodingscored72.80-shotself-reported
MBPPcodingscored72.80-shotself-reported
HumanEvalcodingscored72.60-shotself-reported
MBPPcodingscored71.70-shotself-reported
HumanEvalcodingscored71.4-self-reported
MBPPcodingscored71.2pass@1self-reported
HumanEvalcodingscored68.3pass@1self-reported
HumanEvalcodingscored68.00-shotself-reported
MBPPcodingscored67.5-self-reported
HumanEvalcodingscored67.1pass@1self-reported
MBPPcodingscored66.2pass@1self-reported
MBPPcodingscored65.2-self-reported
HumanEvalcodingscored64.0pass@1self-reported
HumanEvalcodingscored62.8pass@1self-reported
MBPPcodingscored62.8TypeScriptself-reported
MBPPcodingscored60.8pass@1self-reported
BFCLcodingscored60.4-self-reported
MBPPcodingscored59.2pass@1self-reported
HumanEvalcodingscored58.2Javaself-reported
HumanEvalcodingscored56.6TypeScriptself-reported
MBPPcodingscored55.7PHPself-reported
HumanEvalcodingscored54.7PHPself-reported
MBPPcodingscored54.4Javaself-reported
HumanEvalcodingscored54.30-shotself-reported
MBPPcodingscored53.7C++self-reported
HumanEvalcodingscored52.8C++self-reported
MBPPcodingscored49.50-shotself-reported
HumanEvalcodingscored48.8pass@1self-reported
MBPPcodingscored43.3C#self-reported
MBPPcodingscored42.6pass@1self-reported
HumanEvalcodingscored40.20-shotself-reported
HumanEvalcodingscored39.2Shellself-reported
HumanEvalcodingscored38.0C#self-reported
MBPPcodingscored33.0Shellself-reported
HumanEvalcodingscored32.3pass@1self-reported
ARC-Challengegeneral_knowledgescored96.90-shotself-reported
ARC-Challengegeneral_knowledgescored96.70-shotself-reported
ARC-Challengegeneral_knowledgescored96.70-shotself-reported
ARC-Challengegeneral_knowledgescored96.40-shotself-reported
ARC-Challengegeneral_knowledgescored94.80-shotself-reported
ARC-Challengegeneral_knowledgescored94.60-shotself-reported
MMLUgeneral_knowledgescored89.95-shotself-reported
MMLUgeneral_knowledgescored89.15-shotself-reported
MMLUgeneral_knowledgescored88.70-shot, CoTself-reported
ARC-Challengegeneral_knowledgescored88.70-shotself-reported
MMLUgeneral_knowledgescored88.60-shot, CoTself-reported
MMLUgeneral_knowledgescored88.30-shot, CoTself-reported
ARC-Challengegeneral_knowledgescored87.60-shotself-reported
MMLUgeneral_knowledgescored87.35-shotself-reported
MMLUgeneral_knowledgescored86.00-shot, CoTself-reported
MMLUgeneral_knowledgescored85.40-shot, CoTself-reported
MMLUgeneral_knowledgescored85.15-shotself-reported
ARC-Challengegeneral_knowledgescored83.70-shotself-reported
MMLUgeneral_knowledgescored83.65-shotself-reported
ARC-Challengegeneral_knowledgescored83.40-shotself-reported
MMLUgeneral_knowledgescored82.65-shotself-reported
MMLUgeneral_knowledgescored79.90-shot, CoTself-reported
MMLUgeneral_knowledgescored78.70-shot (no CoT)self-reported
MMLU-Progeneral_knowledgescored77.05-shot, CoTself-reported
MMLUgeneral_knowledgescored76.95-shotself-reported
ARC-Challengegeneral_knowledgescored74.20-shotself-reported
MMLU-Progeneral_knowledgescored74.05-shot, CoTself-reported
MMLU-Progeneral_knowledgescored73.35-shot, CoTself-reported
MMLUgeneral_knowledgescored73.00-shot, CoTself-reported
MMLUgeneral_knowledgescored72.35-shot (no CoT)self-reported
MMLUgeneral_knowledgescored72.35-shotself-reported
MMLUgeneral_knowledgescored70.75-shotself-reported
MMLUgeneral_knowledgescored69.80-shot, CoTself-reported
MMLUgeneral_knowledgescored69.45-shotself-reported
MMLU-Progeneral_knowledgescored66.45-shot, CoTself-reported
MMLU-Progeneral_knowledgescored64.85-shot, CoTself-reported
MMLU-Progeneral_knowledgescored62.75-shot, CoTself-reported
MMLUgeneral_knowledgescored61.15-shotself-reported
MMLUgeneral_knowledgescored60.50-shot, CoTself-reported
MMLU-Progeneral_knowledgescored56.35-shot, CoTself-reported
MMLU-Progeneral_knowledgescored49.25-shot, CoTself-reported
MMLU-Progeneral_knowledgescored48.35-shot, CoTself-reported
MMLU-Progeneral_knowledgescored36.95-shot, CoTself-reported
Needle-in-a-Haystack Multilong_contextscored100.0-self-reported
Needle-in-a-Haystack Multilong_contextscored100.0-self-reported
Needle-in-a-Haystack Multilong_contextscored98.8-self-reported
Needle-in-a-Haystack Multilong_contextscored98.1-self-reported
Needle-in-a-Haystack Multilong_contextscored97.5-self-reported
QuALITYlong_contextscored95.2-self-reported
QuALITYlong_contextscored95.2-self-reported
Needle-in-a-Haystack Multilong_contextscored90.8-self-reported
QuALITYlong_contextscored90.5-self-reported
QuALITYlong_contextscored90.5-self-reported
QuALITYlong_contextscored90.5-self-reported
QuALITYlong_contextscored87.65-shot, pre-trainedself-reported
InfiniteBench En.MClong_contextscored83.4-self-reported
QuALITYlong_contextscored82.85-shot, pre-trainedself-reported
InfiniteBench En.MClong_contextscored82.5-self-reported
QuALITYlong_contextscored81.0-self-reported
InfiniteBench En.MClong_contextscored78.2-self-reported
InfiniteBench En.MClong_contextscored72.1-self-reported
InfiniteBench En.MClong_contextscored65.1-self-reported
QuALITYlong_contextscored56.05-shot, pre-trainedself-reported
QuALITYlong_contextscored56.05-shot, pre-trainedself-reported
GSM8Kmathscored96.88-shot, CoTself-reported
GSM8Kmathscored96.40-shotself-reported
GSM8Kmathscored96.18-shot, CoTself-reported
GSM8Kmathscored95.18-shot, CoTself-reported
GSM8Kmathscored94.28-shot, CoTself-reported
GSM8Kmathscored92.30-shotself-reported
GSM8Kmathscored90.016-shot, pre-trainedself-reported
GSM8Kmathscored88.28-shot, CoTself-reported
GSM8Kmathscored84.58-shot, CoTself-reported
GSM8Kmathscored83.016-shot, pre-trainedself-reported
GSM8Kmathscored81.68-shot, CoTself-reported
GSM8Kmathscored76.78-shot, CoTself-reported
MATHmathscored76.60-shot, CoTself-reported
MATHmathscored73.80-shot, CoTself-reported
MATHmathscored71.10-shot, CoTself-reported
MATHmathscored68.00-shot, CoTself-reported
MATHmathscored64.50-shot, CoTself-reported
GSM8Kmathscored60.016-shot, pre-trainedself-reported
GSM8Kmathscored60.016-shot, pre-trainedself-reported
MATHmathscored54.10-shot, CoTself-reported
GSM8Kmathscored53.28-shot, CoTself-reported
MATHmathscored51.90-shot, CoTself-reported
MATHmathscored44.30-shot, CoTself-reported
MATHmathscored43.10-shot, CoTself-reported
MATHmathscored41.10-shot, CoTself-reported
MATHmathscored13.00-shot, CoTself-reported
MGSMmultilingualscored91.60-shot, CoTself-reported
MGSMmultilingualscored91.60-shot, CoTself-reported
MGSMmultilingualscored90.50-shot, CoTself-reported
MGSMmultilingualscored86.90-shot, CoTself-reported
MGSMmultilingualscored85.90-shot, CoTself-reported
Multilingual MMLUmultilingualscored85.55-shotself-reported
Multilingual MMLUmultilingualscored83.25-shotself-reported
Multilingual MMLUmultilingualscored80.25-shotself-reported
Multilingual MMLUmultilingualscored78.25-shotself-reported
MGSMmultilingualscored71.10-shot, CoTself-reported
MGSMmultilingualscored68.90-shot, CoTself-reported
Multilingual MMLUmultilingualscored64.35-shotself-reported
Multilingual MMLUmultilingualscored58.85-shotself-reported
Multilingual MMLUmultilingualscored58.65-shotself-reported
MGSMmultilingualscored53.20-shot, CoTself-reported
MGSMmultilingualscored51.40-shot, CoTself-reported
Multilingual MMLUmultilingualscored46.85-shotself-reported
MGSMmultilingualscored29.90-shot, CoTself-reported
IFEvalreasoningscored88.6-self-reported
IFEvalreasoningscored88.0-self-reported
IFEvalreasoningscored87.5-self-reported
IFEvalreasoningscored85.6-self-reported
IFEvalreasoningscored85.1-self-reported
IFEvalreasoningscored84.3-self-reported
IFEvalreasoningscored80.4-self-reported
IFEvalreasoningscored73.6-self-reported
IFEvalreasoningscored72.7-self-reported
IFEvalreasoningscored69.9-self-reported
GPQAreasoningscored59.40-shot, CoTself-reported
IFEvalreasoningscored57.6-self-reported
GPQAreasoningscored53.60-shot, CoTself-reported
GPQAreasoningscored51.10-shot, CoTself-reported
GPQAreasoningscored46.70-shot, CoTself-reported
GPQAreasoningscored41.40-shot, CoTself-reported
GPQAreasoningscored33.30-shot, CoTself-reported
GPQAreasoningscored32.80-shot, CoTself-reported
GPQAreasoningscored30.80-shot, CoTself-reported
GPQAreasoningscored28.80-shot, CoTself-reported