Moderation API
What this is
Mistral AI's Content Moderation API is a standalone classification service released on November 7, 2024. It is the same system that powers moderation in Le Chat, now made available to external developers. The service is an LLM-based classifier that assigns text inputs to 9 policy categories.
Capabilities
The classifier offers two endpoints: one for raw text and one for conversational content. In conversational mode it classifies the last message within its conversational context. The model is natively multilingual, trained on Arabic, Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, and Spanish. Policy categories include model-generated harms such as unqualified advice and PII, among others defined in the technical documentation.
Evaluation methodology
Performance is reported as AUC PR across policy categories on an internal testset. The document references benchmark charts but provides no numeric scores in the text. No details on testset size, construction, or contamination controls are disclosed.
Safety testing
The card does not discuss red-team exercises, adversarial probing, or catastrophic-risk evaluations. No findings related to CBRN, cyber, or autonomy risks are disclosed.
Mitigations
The API itself is positioned as a system-level guardrail for downstream deployments. Users can tailor it to their specific applications and safety standards. No classifier thresholds, refusal-training details, or ASL/FSF tier designations are disclosed.
Deployment and access
The API is available to Mistral AI users via two documented endpoints. Full policy definitions and setup instructions are in the technical documentation at docs.mistral.ai. No license terms, pricing, or access restrictions are stated in this document.
Limitations
The document notes that "undesirable content is very specific to a given context" as a motivating design consideration, but lists no specific known failure modes, edge cases, or unsolved problems. No limitations are formally disclosed.
What's new
This is the initial public release of the Mistral Moderation API. No prior version or changelog is referenced.