Products/Transcription/MiMo-V2.5 Voice

MiMo-V2.5 Voice

Bilingual ASR for dialects, code-switching, and songs

TranscriptionFounded 2026Eight Chinese dialects natively supported (Wu, Cantonese, Hokkien, Sichuanese)Chinese-English code-switching with no language tagsLyrics transcription under accompaniment and pitch variationMulti-speaker and noisy environment robustnessNative punctuation, no post-processing neededMIT license, Python API, Gradio demo, self-hostable8B open-source speech recognition model

Our Take

Xiaomi just dropped an 8B open-source speech model that actually competes with Whisper on accuracy. MiMo-2.5-ASR handles eight Chinese dialects, code-switched Chinese-English speech, AND song lyrics — no language-tagging post-processing required. The numbers back it up: 5.73% WER on English versus Whisper's 7.44%, 19.55% on Wu dialect versus FunASR's 29.08%, and 3.95% on lyrics. MIT licensed, free, self-hostable — and it addresses what the benchmark babies won't tell you: most ASR models look amazing on clean studio data and then quietly fail in production where audio is noisy, speakers overlap, and people switch languages mid-sentence. This is the move for voice product teams building bilingual or Chinese-language pipelines who need accuracy that actually holds up outside the lab.

MiMo-V2.5-ASR is an 8B open-source speech recognition model from Xiaomi that transcribes Mandarin, English, eight Chinese dialects, code-switched speech, and song lyrics. Built for ML engineers, researchers, and developers building real-world voice applications.

Problem It Solves
Most ASR models are benchmarked on clean studio data and deployed into the real world, where audio is noisy, speakers overlap, and people switch languages mid-sentence. The gap between benchmark accuracy and production accuracy is where voice products quietly fail.
Target Customer
ML engineers and voice product teams building bilingual or Chinese-language transcription pipelines who need accuracy that holds up outside the lab.
Use Cases
Bilingual Chinese-English transcription, Regional dialect transcription, Song lyrics transcription, Voice applications for multilingual environments, Tourism audio guides
Pricing Details
MIT licensed, open-source, self-hostable
Free Tier
true
Differentiator
On Open ASR Leaderboard: 5.73% WER on English vs Whisper large-v3 at 7.44%, 19.55% on Wu dialect vs FunASR-1.5 at 29.08%, 3.95% on lyrics vs Gemini 2.5 Pro at 4.25%. Staged training combining mid-training, supervised fine-tuning, and reinforcement learning specifically targeting real-world scenarios.
Why Now
Open-source ASR has been catching up to closed models for years. MiMo-V2.5-ASR demonstrates the gap is now very small, and in some scenarios gone.
Traction
Notable Metrics: 110 followers, 114 points, Day Rank #7

Key Facts

Category
Transcription
Location
, China
Founded
2026
Pricing
Free
Discovered via
product-hunt

Links

Similar products worth knowing

Want products like this in your inbox every morning?

Five products. Every morning. Written by someone who actually cares whether they're good or not. Free forever, unsubscribe whenever.