Guardrailing Docs
What this is
This is Mistral AI's technical documentation for its moderation and guardrailing system, published April 23, 2026. It covers two deployment paths: Custom Guardrails, which embed moderation rules directly in API requests, and a standalone Moderation API that returns raw category scores. The document is a developer reference, not a safety card; it describes configuration and behavior rather than model training or evaluation outcomes.
Capabilities
The moderation service, powered by mistral-moderation-2603, classifies text across 11 named policy categories: Sexual, Hate and Discrimination, Violence and Threats, Dangerous, Criminal, Self-Harm, Health, Financial, Law, PII, and Jailbreaking. Two endpoints are available: one for raw text (single strings or small batches) and one for conversational content. Each category returns a continuous score on a 0–1 scale, and per-category thresholds are configurable. The document does not disclose context window size, parameter count, or benchmark scores for the moderation model.
Evaluation methodology
Mistral states that policy thresholds are "determined based on the optimal performance of our internal test set." No further details about test set composition, size, contamination controls, or external evaluators are disclosed in this document.
Safety testing
The document does not discuss red-teaming, adversarial probing, or catastrophic-risk evaluations (CBRN, cyber, autonomy). The FAQ section references false-positive and false-negative distribution as a question but provides no answer in the supplied text. No safety findings, crossing-threshold incidents, or hedge language about residual risk appear in this document.
Mitigations
Custom Guardrails apply input moderation only — they run before the request reaches the model. When a guardrail is triggered, the request is blocked and a 403 error is returned with per-category scores, thresholds, and violation flags. Operators configure per-category thresholds (0–1 scale; setting a category to 1 disables it) and can set block_on_error: true to block requests when the moderation API itself fails. Multiple guardrail objects can be stacked per request; any single triggered guardrail blocks the request. Agent-level guardrails can be attached at agent creation time and inherited by all subsequent conversations, with per-request override available.
Deployment and access
The moderation and guardrailing system is accessible via Mistral's API across three surfaces: POST /v1/chat/completions, POST /v1/conversations, and agent creation endpoints. mistral-moderation-2411 was deprecated on March 31, 2026; mistral-moderation-2603 is the current default. The document does not specify licensing terms, pricing, rate limits, or geographic restrictions.
Limitations
Mistral notes that "the policy threshold is determined based on the optimal performance of our internal test set" and that operators "can use the raw score or adjust the threshold according to your specific use cases," implying the defaults may not suit all deployments. The documentation explicitly warns: "We intend to continually improve the underlying model of the moderation endpoint. Custom policies that depend on category_scores can require recalibration." Guardrails apply to inputs only; output moderation is not covered by the Custom Guardrails path. No claims are made about recall, precision, or cross-lingual performance.
What's new
The primary version change documented is the deprecation of mistral-moderation-2411 on March 31, 2026, replaced by mistral-moderation-2603. The moderation_llm_v2 config type is the current guardrail configuration format, with a v1 legacy format still referenced in code examples. No further changelog or version delta information is provided in this document.