Model Cards / Mistral AI

Guardrailing Docs

usage policy1,807 words·8 min read·Apr 23, 2026·Source
Version History
Summary

Guardrailing Docs

A 512-word brief of a 1,807-word document. Published by Mistral AI. Version dated Apr 23, 2026.
01

What this is

This is Mistral AI's technical documentation for its moderation and guardrailing system, published April 23, 2026. It covers two deployment paths: Custom Guardrails, which embed moderation rules directly in API requests, and a standalone Moderation API that returns raw category scores. The document is a developer reference, not a safety card; it describes configuration and behavior rather than model training or evaluation outcomes.

02

Capabilities

The moderation service, powered by mistral-moderation-2603, classifies text across 11 named policy categories: Sexual, Hate and Discrimination, Violence and Threats, Dangerous, Criminal, Self-Harm, Health, Financial, Law, PII, and Jailbreaking. Two endpoints are available: one for raw text (single strings or small batches) and one for conversational content. Each category returns a continuous score on a 0–1 scale, and per-category thresholds are configurable. The document does not disclose context window size, parameter count, or benchmark scores for the moderation model.

03

Evaluation methodology

Mistral states that policy thresholds are "determined based on the optimal performance of our internal test set." No further details about test set composition, size, contamination controls, or external evaluators are disclosed in this document.

04

Safety testing

The document does not discuss red-teaming, adversarial probing, or catastrophic-risk evaluations (CBRN, cyber, autonomy). The FAQ section references false-positive and false-negative distribution as a question but provides no answer in the supplied text. No safety findings, crossing-threshold incidents, or hedge language about residual risk appear in this document.

05

Mitigations

Custom Guardrails apply input moderation only — they run before the request reaches the model. When a guardrail is triggered, the request is blocked and a 403 error is returned with per-category scores, thresholds, and violation flags. Operators configure per-category thresholds (0–1 scale; setting a category to 1 disables it) and can set block_on_error: true to block requests when the moderation API itself fails. Multiple guardrail objects can be stacked per request; any single triggered guardrail blocks the request. Agent-level guardrails can be attached at agent creation time and inherited by all subsequent conversations, with per-request override available.

06

Deployment and access

The moderation and guardrailing system is accessible via Mistral's API across three surfaces: POST /v1/chat/completions, POST /v1/conversations, and agent creation endpoints. mistral-moderation-2411 was deprecated on March 31, 2026; mistral-moderation-2603 is the current default. The document does not specify licensing terms, pricing, rate limits, or geographic restrictions.

07

Limitations

Mistral notes that "the policy threshold is determined based on the optimal performance of our internal test set" and that operators "can use the raw score or adjust the threshold according to your specific use cases," implying the defaults may not suit all deployments. The documentation explicitly warns: "We intend to continually improve the underlying model of the moderation endpoint. Custom policies that depend on category_scores can require recalibration." Guardrails apply to inputs only; output moderation is not covered by the Custom Guardrails path. No claims are made about recall, precision, or cross-lingual performance.

08

What's new

The primary version change documented is the deprecation of mistral-moderation-2411 on March 31, 2026, replaced by mistral-moderation-2603. The moderation_llm_v2 config type is the current guardrail configuration format, with a v1 legacy format still referenced in code examples. No further changelog or version delta information is provided in this document.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA bc9e00babf1b.