Model Cards / Mistral AI

Mistral Small 24B Instruct 2501 Model Card

model card1,740 words·8 min read·Apr 17, 2026·Source
Summary

Mistral Small 24B Instruct 2501 Model Card

A 371-word brief of a 1,740-word document. Published by Mistral AI. Version dated Apr 17, 2026.
01

What this is

Mistral-Small-24B-Instruct-2501, marketed as "Mistral Small 3," is a 24-billion-parameter instruction-fine-tuned language model released by Mistral AI in January 2025. It is built on the Mistral-Small-24B-Base-2501 base model and positioned as state-of-the-art in the sub-70B category. Its stated design goals are local deployment, fast conversational response, low-latency function calling, and fine-tuning for domain experts.

02

Capabilities

The model scores 0.663 on MMLU Pro 5-shot CoT, 0.453 on GPQA Main 5-shot CoT, 0.848 on HumanEval pass@1, and 0.706 on MATH instruct. Instruction-following benchmarks show 8.35 on MT-Bench, 52.27 on WildBench, and 0.873 on Arena Hard. The model supports a 32k context window, eleven or more languages including English, French, German, Spanish, Italian, Chinese, Japanese, and Korean, and provides native function calling and JSON output via a Tekken tokenizer with a 131k vocabulary.

03

Evaluation methodology

Public benchmarks were run through a single internal evaluation pipeline; Mistral notes that numbers "may vary slightly from previously reported performance" for comparison models. Judge-based evals — WildBench, Arena Hard, and MT-Bench — used GPT-4o-2024-05-13 as the judge. Human preference evaluations were conducted side-by-side with an external third-party vendor on over 1,000 proprietary coding and generalist prompts, with model identity anonymized; Mistral states it took "extra caution in verifying a fair evaluation" and is "confident that the above benchmarks are valid."

04

Safety testing

Not disclosed in this document.

05

Mitigations

Not disclosed in this document.

06

Deployment and access

The model is released under the Apache 2.0 license, permitting commercial and non-commercial use and modification. It is available via vLLM (requiring approximately 55 GB GPU RAM in bf16 or fp16), Hugging Face Transformers, and Ollama for local inference. Quantized versions (4-bit, 8-bit) can run on a single RTX 4090 or a 32 GB RAM MacBook. The card notes that enterprises needing extended context, specific modalities, or domain knowledge will be served by separate commercial models beyond this open release.

07

Limitations

The model's knowledge base has a cutoff of October 1, 2023, as stated in the recommended system prompt. The card includes no explicit limitations section and does not discuss failure modes, bias evaluations, or out-of-distribution behavior.

08

What's new

This is the January 2025 release of Mistral Small, designated "Mistral Small 3 (2501)," superseding prior Mistral Small releases. No version changelog or explicit list of changes from a prior version is included in the card.

Generated by Claude sonnet from the cleaned source on Apr 23, 2026. Passages in double quotes are verbatim from the source; other text is neutral paraphrase. For citation, use the original: original document · source SHA da1a4bbc9b6d.

Extracted Evaluations(44 results)

Sort by:44 evals
BenchmarkCategoryStateScoreVariantSource
HumanEvalcodingscored90.9pass@1self-reported
HumanEvalcodingscored89.0pass@1self-reported
HumanEvalcodingscored85.4pass@1self-reported
HumanEvalcodingscored84.8pass@1self-reported
HumanEvalcodingscored73.2pass@1self-reported
MMLU-Progeneral_knowledgescored68.35-shot CoTself-reported
MMLU-Progeneral_knowledgescored66.65-shot CoTself-reported
MMLU-Progeneral_knowledgescored66.35-shot CoTself-reported
MMLU-Progeneral_knowledgescored61.75-shot CoTself-reported
MMLU-Progeneral_knowledgescored53.65-shot CoTself-reported
MATHmathscored81.9-self-reported
MATHmathscored76.1-self-reported
MATHmathscored74.3-self-reported
MATHmathscored70.6-self-reported
MATHmathscored53.5-self-reported
WildBenchotherscored56.1-self-reported
WildBenchotherscored52.7-self-reported
WildBenchotherscored52.3-self-reported
WildBenchotherscored50.0-self-reported
WildBenchotherscored48.2-self-reported
MT-Benchotherscored8.3-self-reported
MT-Benchotherscored8.3-self-reported
MT-Benchotherscored8.3-self-reported
MT-Benchotherscored8.0-self-reported
MT-Benchotherscored7.9-self-reported
Arena Hardotherscored0.9-self-reported
Arena Hardotherscored0.9-self-reported
Arena Hardotherscored0.9-self-reported
Arena Hardotherscored0.8-self-reported
Arena Hardotherscored0.8-self-reported
Human Evaluation (vs Gemma-2-27B)otherscored0.5side-by-sideself-reported
Human Evaluation (vs Qwen-2.5-32B)otherscored0.5side-by-sideself-reported
Human Evaluation (vs GPT-4o-mini)otherscored0.2side-by-sideself-reported
Human Evaluation (vs Llama-3.3-70B)otherscored0.2side-by-sideself-reported
IFEvalreasoningscored88.3-self-reported
IFEvalreasoningscored85.0-self-reported
IFEvalreasoningscored84.0-self-reported
IFEvalreasoningscored82.9-self-reported
IFEvalreasoningscored80.7-self-reported
GPQAreasoningscored53.15-shot CoTself-reported
GPQAreasoningscored45.35-shot CoTself-reported
GPQAreasoningscored40.45-shot CoTself-reported
GPQAreasoningscored37.75-shot CoTself-reported
GPQAreasoningscored34.45-shot CoTself-reported