Model Cards / Meta AI

Llama Guard Model Card

constitution805 words·4 min read·Mar 31, 2026·Source
Summary

Llama Guard Model Card

A 413-word brief of a 805-word document. Published by Meta AI. Version dated Mar 31, 2026.
01

What this is

Llama Guard is a 7B parameter model from Meta AI built on Llama 2, designed to classify content in both LLM inputs and LLM outputs as safe or unsafe. It operates as an LLM itself, generating text that indicates a safety verdict and, when unsafe, lists the violating subcategories from a defined taxonomy. The card carries a version date of 2026-03-31 and no prior version is named.

02

Capabilities

Llama Guard produces a probability score for the "unsafe" class, allowing operators to set their own binary threshold. On Meta's internal test set it achieves 0.945 (prompt) and 0.953 (response), outperforming the OpenAI moderation API (0.764 / 0.769) and Perspective API (0.728 / 0.699) on those same sets. On ToxicChat it scores 0.626, again leading both comparators, though it trails the OpenAI API slightly on the public OpenAI Moderation benchmark (0.847 vs. 0.856). The card does not disclose a context window size.

03

Evaluation methodology

The model is evaluated against OpenAI Content Moderation API, Azure Content Safety, and Google's Perspective API across two public benchmarks (ToxicChat and OpenAI Moderation) and two in-house test sets covering prompt and response classification. The card notes that cross-system comparisons are "not exactly apples-to-apples due to mismatches in each taxonomy." No contamination controls or held-out split details are described.

04

Safety testing

The card does not discuss red-teaming or catastrophic-risk evaluations of Llama Guard itself. The model is positioned as a safety tool rather than a subject of safety evaluation; no CBRN, cyber, or autonomy assessments are mentioned.

05

Mitigations

Llama Guard enforces a six-category harm taxonomy covering Violence & Hate, Sexual Content, Guns & Illegal Weapons, Regulated or Controlled Substances, Suicide & Self Harm, and Criminal Planning. Operators derive binary safe/unsafe decisions by applying a chosen probability threshold to the model's first-token output. No additional classifier layers, rate limits, or ASL/FSF tier designations are described.

06

Deployment and access

The model and its taxonomy are released openly as part of Meta's PurpleLlama project. The card does not specify a license, API endpoint, or access restrictions. The released taxonomy is described as intended for community use and explicitly noted as not necessarily reflecting Meta's internal policies.

07

Limitations

Cross-system benchmark comparisons are acknowledged as imprecise due to taxonomy mismatches. The released taxonomy "does not necessarily reflect Meta's own internal policies" and is framed as a demonstration rather than a production policy artifact. The card does not address out-of-distribution performance, multilingual coverage, or failure modes.

08

What's new

The card does not describe version deltas or changelog entries. This appears to be the initial public release of Llama Guard.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA b700f9d6de22.