Model Cards / Meta AI

Llama Guard Paper

constitution7,151 words·31 min read·Mar 31, 2026·Source
Summary

Llama Guard Paper

A 574-word brief of a 7,151-word document. Published by Meta AI. Version dated Mar 31, 2026.
01

What this is

Llama Guard is an LLM-based input-output safeguard model for human-AI conversations, introduced by Meta AI (GenAI at Meta) on December 13, 2023. It is built on Llama2-7b via instruction fine-tuning on a purpose-collected, human-annotated dataset. The model is designed to classify both user prompts and AI-generated responses against a defined safety risk taxonomy, producing binary safe/unsafe decisions with optional category labels.

02

Capabilities

Llama Guard achieves AUPRC of 0.945 on its own prompt test set and 0.953 on its own response test set, outperforming OpenAI Moderation API (0.764 / 0.769) and Perspective API (0.728 / 0.699) on the same data. On the OpenAI Moderation Evaluation Dataset (1,680 prompts), it scores 0.847 AUPRC in zero-shot adaptation, and 0.626 on ToxicChat, surpassing all baselines on the latter. The model supports multi-class and binary classification, zero-shot and few-shot prompting with novel taxonomies, and customizable output formats via instruction fine-tuning. It processes sequences up to 4,096 tokens.

03

Evaluation methodology

The authors evaluate on three datasets: an internal annotated test set (a 3:1 train/test split from 13,997 examples), the OpenAI Moderation Evaluation Dataset, and ToxicChat (10,000 samples). Primary metric is AUPRC; precision, recall, and F1 at a 0.5 threshold are reported for baselines lacking probability outputs (Azure API, GPT-4). Three comparison protocols are used: overall binary classification (max-over-categories), per-category 1-vs-all, and per-category 1-vs-benign, with the last applied to fixed-output baseline APIs in the off-policy setting to avoid unfair penalization for category misalignment.

04

Safety testing

External red teaming contractors evaluated Llama Guard after release preparation; the paper states "the outcome of this exercise did not point us to additional risks beyond those of the pretrained Llama2-7b model." In-house red teamers annotated the fine-tuning dataset but this process is described as separate from production model red teaming. No CBRN, cyber, or autonomy-specific evaluations are conducted; the paper does not address catastrophic-risk thresholds, as Llama Guard is a classifier tool rather than a frontier generative model.

05

Mitigations

Llama Guard itself is deployed as the mitigation layer, covering six taxonomy categories: Violence & Hate, Sexual Content, Guns & Illegal Weapons, Regulated or Controlled Substances, Suicide & Self-Harm, and Criminal Planning. Two data augmentation techniques are used during training to ensure the model respects only the categories included in the input prompt: random category dropout and category-masking with label flipping to safe. Category indices are shuffled across training examples to prevent format memorization. No external deployment access controls, classifier confidence thresholds, or ASL/FSF tier designations are described.

06

Deployment and access

Model weights are publicly released; code is available at the PurpleLlama GitHub repository under Meta's Purple Llama initiative. The paper encourages researchers to fine-tune and adapt the weights freely without dependence on paid APIs. No specific license terms, rate limits, or restricted-use conditions are stated in the document.

07

Limitations

The paper flags four limitations: common-sense knowledge is bounded by pretraining data; training data is primarily English, so performance on other languages is not guaranteed; policy coverage is acknowledged as incomplete — "we don't claim that we have perfect coverage of our policy"; and the model is susceptible to prompt injection attacks that could alter or bypass intended classification behavior. When prompted as a chat model rather than a classifier, Llama Guard may generate unsafe language due to the absence of safety fine-tuning for chat use cases.

08

What's new

This is the original Llama Guard publication; no prior version exists to diff against. The paper introduces the model, taxonomy, dataset, and training methodology for the first time. No changelog or version delta entries are present in the document.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA 02d0bca96e27.