Llama 3.1 Technical Paper
What this is
Llama 3 — all results in the paper are for the Llama 3.1 release — is a herd of dense Transformer language models from Meta AI, publicly released July 23, 2024, superseding Llama 2. The herd ships in three sizes (8B, 70B, and 405B parameters), with the flagship 405B model supporting a context window of up to 128K tokens. It is designed for multilinguality, coding, reasoning, and tool use, and is released in both pre-trained and post-trained variants under the Llama 3 Community License.
Capabilities
The post-trained 405B model scores 87.3 on MMLU (5-shot), 89.0 on HumanEval (0-shot), 96.8 on GSM8K (8-shot, CoT), 73.8 on MATH (0-shot, CoT), 96.9 on ARC Challenge (0-shot), and 51.1 on GPQA (0-shot, CoT). It answers questions in at least eight languages, generates and executes code across ten priority programming languages, and supports zero-shot and multi-step tool use including a search engine, Python interpreter, and Wolfram Alpha API. The 8B and 70B models are described as best-in-class at their parameter scales, outperforming comparable open models on virtually every benchmark category evaluated.
Evaluation methodology
Pre-trained models are assessed across eight benchmark categories — commonsense reasoning, knowledge, reading comprehension, math and reasoning, long context, code, adversarial, and aggregate — with scores reported alongside 95% confidence intervals computed as 1.96 × √(S(1−S)/N). A contamination analysis following Singh et al. (2024) estimates the extent to which pre-training data overlaps with evaluation sets, though the paper acknowledges that all contamination methods can suffer from false positives and negatives. Post-trained models are additionally evaluated via human comparisons against competing models. Adversarial benchmarks (Adversarial SQuAD, Dynabench SQuAD, GSM-Plus, PAWS) probe robustness and potential benchmark overfitting.
Safety testing
The paper states that "a detailed analysis of the safety of Llama 3" appears in Section 5.4, which is not included in the provided source text. No red-team scope, CBRN evaluations, autonomy thresholds, or specific catastrophic-risk findings are described in the available text. The paper notes the model delivers "a much better balance between helpfulness and harmlessness than its predecessor" and that safety mitigations are incorporated at the post-training stage.
Mitigations
Meta co-releases Llama Guard 3 for input and output safety classification alongside the language models. Pre-training data pipelines filter out domains likely to contain PII, content ranked as harmful under Meta safety standards, and known adult-content domains. A knowledge probing technique generates refusal training data by identifying questions the model answers incorrectly and consistently, aligning the model to "know what it knows" rather than hallucinate. Post-training safety mitigations are incorporated across multiple rounds of supervised finetuning and DPO.
Deployment and access
All three model sizes (8B, 70B, 405B) are publicly released in both pre-trained and post-trained variants at llama.meta.com under the Llama 3 Community License. Multimodal extensions integrating image, video, and speech capabilities are described in the paper but are explicitly noted as "still under development and not yet ready for release." Core tool integrations (search, Python interpreter, Wolfram Alpha) must be individually enabled or disabled via system prompt.
Limitations
Adversarial performance on mathematical reasoning and question answering is substantially lower than non-adversarial performance across all three model sizes; post-training does not close this gap. Contamination analysis methodology is acknowledged as an open research problem susceptible to false positives and negatives. Annealing improvements on GSM8K and MATH are negligible for the 405B model, in contrast to meaningful gains for the 8B model. Multimodal models integrating vision and speech are explicitly flagged as not production-ready.
What's new
Llama 3.1 adds native multilingual support (eight languages), a 128K-token context window, and tool use capabilities not present in the April 2024 Llama 3 8B/70B releases. Pre-training data scales from 1.8T tokens (Llama 2) to 15.6T tokens, and flagship training compute reaches 3.8×10²⁵ FLOPs — approximately 50× the largest Llama 2 model. A new 128K-token vocabulary tokenizer improves English compression from 3.17 to 3.94 characters per token.