Voxtral TTS is an open-weights, 4-billion-parameter text-to-speech (TTS) model developed and released by Mistral AI in March 2026. Designed to deliver high-quality, expressive, and lifelike speech synthesis, Voxtral TTS serves as the generative counterpart to Mistral’s earlier “Voxtral” speech understanding models. With its relatively small footprint and highly optimized architecture, it is engineered for fast, low-latency performance in enterprise applications, making it well-suited for voice agents, local inference, and production pipelines.
Architecture: A Hybrid Approach
Voxtral TTS achieves its impressive naturalness and latency profile through a novel hybrid architecture that combines two distinct generative paradigms:
- Auto-regressive Generation for Semantic Tokens: The model first generates semantic speech tokens auto-regressively. This step is responsible for linguistic comprehension, prosody planning, and capturing the emotional and semantic context of the input text.
- Flow-matching for Acoustic Tokens: Following the semantic generation, the model utilizes a flow-matching framework to translate the semantic representations into high-fidelity acoustic tokens. Flow-matching allows for highly parallelized, single-pass (or few-step) inference, bypassing the slow, iterative denoising process inherent in standard diffusion models.
graph TD
classDef default fill:#ffffff,stroke:#4338CA,stroke-width:2px,color:#0F172A,rx:8px,ry:8px;
classDef data fill:#EEF0F7,stroke:#0D9488,stroke-width:2px,color:#0F172A,rx:8px,ry:8px;
classDef process fill:#F7F8FC,stroke:#6366F1,stroke-width:2px,color:#0F172A,rx:8px,ry:8px;
classDef output fill:#4338CA,stroke:#4338CA,stroke-width:2px,color:#ffffff,rx:8px,ry:8px;
A([Text Input]):::data --> B(Text Encoder):::process
B --> C(Auto-regressive Semantic Token Generation):::process
C --> D(Flow-matching Acoustic Decoder):::process
D --> E(Waveform Generation):::process
E --> F([24kHz High-Fidelity Audio Output]):::output
Key Features and Capabilities
- Multilingual Support: Voxtral TTS natively supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.
- Zero-shot Voice Cloning: The model is highly flexible, offering zero-shot voice cloning capabilities. It can synthesize a target speaker’s voice using as little as a 3-second reference audio clip.
- Built-in Personas: For users who do not require voice cloning, the model includes 20 pre-built, highly expressive voice presets.
- Low Latency: Engineered for real-time and interactive applications, the flow-matching component ensures minimal Time-To-First-Audio (TTFA).
Evaluation Metrics
To evaluate the naturalness and similarity of the generated speech to human speech, Mistral AI relies on objective and subjective metrics. The primary subjective metric is Mean Opinion Score (MOS), where human evaluators rate audio clips on a scale from 1 (poor) to 5 (excellent).
Comparative Mean Opinion Score (MOS) for Naturalness
Scale from 1 (poor) to 5 (excellent)
Another critical metric for voice cloning is the Speaker Similarity Score (SIM) or Cosine Similarity of speaker embeddings extracted by a speaker verification model (e.g., ECAPA-TDNN).
Official Implementation Snippet
Deploying Voxtral TTS locally using the official Hugging Face transformers library pipeline is straightforward. Below is a code snippet demonstrating how to load the model and perform zero-shot voice cloning based on the official documentation:
import torch
from transformers import AutoProcessor, AutoModelForTextToWaveform
import soundfile as sf
# Load the processor and model from the official Mistral AI repository
processor = AutoProcessor.from_pretrained("mistralai/Voxtral-TTS-4B")
model = AutoModelForTextToWaveform.from_pretrained("mistralai/Voxtral-TTS-4B", torch_dtype=torch.float16)
model.to("cuda")
# Prepare text and a 3-second reference audio for voice cloning
text = "Welcome to the future of open-weights voice AI."
reference_audio_path = "path/to/3_second_reference.wav"
inputs = processor(
text=text,
audios=reference_audio_path,
return_tensors="pt"
).to("cuda")
# Generate the waveform
with torch.no_grad():
output_waveform = model.generate(**inputs)
# Save the generated audio
sf.write("output.wav", output_waveform.cpu().numpy().squeeze(), samplerate=24000)
Context and Limitations
It is important to distinguish Voxtral TTS from the original “Voxtral” models released in 2025. The 2025 models were primarily focused on speech understanding (such as speech-to-text, audio translation, and reasoning over audio inputs). Voxtral TTS completes the ecosystem by providing the generation/output capability.
While the model weights are open, researchers have noted that specific internal components (such as parts of the reference encoder) were truncated in the public release for safety and alignment reasons. This slightly restricts the model’s ability to perfectly clone highly idiosyncratic non-speech vocalizations (like distinct laughs or coughs) compared to the unconstrained internal research versions.
Ready to build?
Leverage AI technologies to build your product stack
Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.
Talk to Superteams