Model Cards / Mistral AI

Moderation API

constitution313 words·1 min read·Apr 8, 2026·Source
Summary

Moderation API

A 293-word brief of a 313-word document. Published by Mistral AI. Version dated Apr 8, 2026.
01

What this is

Mistral AI's Content Moderation API is a standalone classification service released on November 7, 2024. It is the same system that powers moderation in Le Chat, now made available to external developers. The service is an LLM-based classifier that assigns text inputs to 9 policy categories.

02

Capabilities

The classifier offers two endpoints: one for raw text and one for conversational content. In conversational mode it classifies the last message within its conversational context. The model is natively multilingual, trained on Arabic, Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, and Spanish. Policy categories include model-generated harms such as unqualified advice and PII, among others defined in the technical documentation.

03

Evaluation methodology

Performance is reported as AUC PR across policy categories on an internal testset. The document references benchmark charts but provides no numeric scores in the text. No details on testset size, construction, or contamination controls are disclosed.

04

Safety testing

The card does not discuss red-team exercises, adversarial probing, or catastrophic-risk evaluations. No findings related to CBRN, cyber, or autonomy risks are disclosed.

05

Mitigations

The API itself is positioned as a system-level guardrail for downstream deployments. Users can tailor it to their specific applications and safety standards. No classifier thresholds, refusal-training details, or ASL/FSF tier designations are disclosed.

06

Deployment and access

The API is available to Mistral AI users via two documented endpoints. Full policy definitions and setup instructions are in the technical documentation at docs.mistral.ai. No license terms, pricing, or access restrictions are stated in this document.

07

Limitations

The document notes that "undesirable content is very specific to a given context" as a motivating design consideration, but lists no specific known failure modes, edge cases, or unsolved problems. No limitations are formally disclosed.

08

What's new

This is the initial public release of the Mistral Moderation API. No prior version or changelog is referenced.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA 41a9fd11abc4.