OpenBMB Open-Sources VoxCPM2: High-Quality Voice Cloning No Longer Closed-Source
OpenBMB released VoxCPM2 this week: a 2-billion-parameter, open-source TTS (text-to-speech) model supporting 30 languages, meaning the barrier to high-quality voice cloning is being drastically lowered.
What this is
VoxCPM2 is the latest speech synthesis model jointly launched by OpenBMB and Tsinghua University. Traditional TTS usually chops voice into codes (discrete audio tokenizers) and then pieces them together, a process prone to losing details like breath and emotion. VoxCPM2 adopts "continuous speech representation" (modeling and generating directly in the continuous waveform space), which is equivalent to not chopping it up but directly drawing the complete sound curve, making the generated speech more natural. Its core selling points are: First, support for 30 languages and 9 Chinese dialects, synthesizing directly from text input; Second, timbre design and controllable cloning—you can "sculpt" a brand-new voice using natural language, clone a voice from an audio snippet, and control the cloned voice's emotion and speed with text instructions; Third, open-source under the Apache-2.0 license, natively outputting 48kHz high-fidelity audio, free for commercial use.
Industry view
We note that open-source voice models are rapidly closing the performance gap with closed-source products. VoxCPM2 has matched closed-source models like MiniMax in public evaluations, which is good news for SMEs, meaning they now have low-cost, localized voice solutions without being constrained by big tech API call costs and privacy restrictions. However, we should be more concerned about its deployment barriers and risks. Although the model is open-source, the 2 billion parameters plus high-fidelity output mean inference compute costs are not low (official tests show a real-time factor of about 0.3 on an RTX 4090); SMEs must weigh hardware investments before actually running it. More troublingly, as the cloning barrier drops to requiring only a few seconds of audio, the risk of Deepfake voice fraud is rising sharply, and the industry still lacks effective voice provenance and anti-spoofing mechanisms.
Impact on regular people
For enterprise IT: Customer service systems and audio content production can directly deploy open-source models locally, shaking off dependence on external APIs and reducing long-term operating costs. For the individual workplace: Pure voice execution roles like junior dubbing and audiobook narration will face automation squeeze, and the authorization and rights confirmation of personal "voice assets" will become a real issue. For the consumer market: We will hear more natural AI voices, even with dialect accents, significantly enhancing the lifelike experience of digital companions and in-car assistants.