What is VoxCPM?
VoxCPM (version 2) is an open-source, tokenizer-free text-to-speech system from OpenBMB. Unlike conventional TTS pipelines that tokenize speech into discrete units, VoxCPM uses a diffusion autoregressive architecture to generate continuous speech representations end-to-end. The result is highly natural, expressive synthesis — particularly for multilingual content — without the quality ceiling imposed by tokenization artifacts. The current VoxCPM2 model is a 2 billion parameter system trained on over 2 million hours of multilingual speech data.
Who it's for
VoxCPM is designed for AI researchers, developers building multilingual voice applications, and teams that need professional TTS without per-character cloud billing. It is especially useful for applications requiring non-English voice synthesis, voice cloning with style control, or high-quality audio output at 48kHz. The Apache-2.0 license makes it safe for commercial projects without legal risk.
Key capabilities
- 30-language multilingual support: Arabic, Chinese, English, French, German, Japanese, Korean, Spanish, and 22 more — no language tags required
- Voice Design: generate a brand-new voice from a natural-language description (gender, age, tone, emotion, pace)
- Controllable Voice Cloning: clone any voice from a short reference clip with optional style guidance
- 48kHz studio-quality audio: built-in super-resolution via asymmetric AudioVAE V2 — no external upsampler needed
- Real-time streaming: RTF as low as ~0.13 on NVIDIA RTX 4090 with vLLM-Omni serving
- OpenAI-compatible API via vLLM integration for easy drop-in deployment
- Fully open-source weights and code under the Apache-2.0 license
Why choose it over ElevenLabs or Azure TTS?
ElevenLabs, Google Cloud TTS, and Azure Cognitive Services Text-to-Speech are powerful cloud services, but they charge per character, impose rate limits, and retain audio data on their servers. VoxCPM runs entirely on your hardware — a single GPU workstation can serve real-time multilingual speech synthesis at no marginal cost. For teams processing high volumes of text-to-speech in non-English languages, or building voice products where latency and data privacy are critical, VoxCPM offers a compelling self-hosted path with state-of-the-art multilingual quality.

