Purple Llama (Safety Tools)
What this is
Purple Llama is an umbrella project from Meta AI, with a version date of April 8, 2026, designed to aggregate tools and evaluations that help the community build responsibly with open generative AI models. The initial release focuses on cybersecurity evaluations and input/output safeguards, with additional components planned. The project name invokes the cybersecurity concept of "purple teaming" — combining red-team (offensive) and blue-team (defensive) responsibilities.
Capabilities
Purple Llama bundles three safeguard components — Llama Guard 3, Prompt Guard, and Code Shield — alongside a versioned cybersecurity benchmark suite (CyberSec Eval v1 through v3). Llama Guard 3 supports 7 languages, a 128k context window, and image reasoning, built by fine-tuning Meta-Llama 3.1 and 3.2 models against the MLCommons standard hazards taxonomy. CyberSec Eval 3 adds visual prompt injection tests, spear phishing capability tests, and autonomous offensive cyber operations tests.
Evaluation methodology
CyberSec Eval benchmarks are based on industry standards including CWE and MITRE ATT&CK, and were built in collaboration with Meta's security subject matter experts. The suite measures insecure code suggestion frequency, compliance with malicious requests, code interpreter abuse propensity, offensive cybersecurity capabilities, and susceptibility to prompt injection. A public leaderboard is hosted on Hugging Face for CyberSec Eval results.
Safety testing
CyberSec Eval v1, described as "what we believe was the first industry-wide set of cybersecurity safety evaluations for LLMs," tested models for insecure code recommendations and willingness to comply with malicious requests. Meta states that "initial results show that there are meaningful cybersecurity risks for LLMs, both with recommending insecure code and for complying with malicious requests." CyberSec Eval 3 extends testing to visual prompt injection, spear phishing generation, and autonomous offensive cyber operations.
Mitigations
Llama Guard 3 provides inference-time input and output moderation across a range of hazard categories, including detection of helpful cyberattack responses and malicious code output in code-interpreter environments. Prompt Guard targets prompt injection and jailbreak attacks — inputs that exploit untrusted third-party data or attempt to override built-in safety features. Code Shield adds inference-time filtering of insecure code suggestions, code interpreter abuse prevention, and secure command execution.
Deployment and access
Evals and benchmarks are released under the MIT license; safeguard models are released under the corresponding Llama Community licenses (Llama 2, Llama 3, or Llama 3.2 depending on the component). Models are available on Hugging Face under the meta-Llama organization. Integration resources are available in the Llama-recipes GitHub repository as part of the Llama reference system.
Limitations
Meta states "we believe these tools will reduce the frequency of LLMs suggesting insecure AI-generated code," but does not quantify the reduction or provide controlled before/after data. The document does not discuss false-positive or false-negative rates for Llama Guard or Prompt Guard. No limitations on coverage gaps, adversarial robustness, or non-English language performance are disclosed.
What's new
CyberSec Eval 3 is described as "newly released" and adds three test suites not present in prior versions: visual prompt injection, spear phishing capability, and autonomous offensive cyber operations. Llama Guard 3 extended prior Llama Guard versions with multimodal (vision) support, expanded language coverage, and a 128k context window. The document does not provide a dated changelog for earlier Purple Llama releases.