Products/Transcription/MiMo-V2.5 Voice

MiMo-V2.5 Voice

Bilingual ASR for dialects, code-switching, and songs

TranscriptionFounded 2026Eight Chinese dialects natively supported (Wu, Cantonese, Hokkien, Sichuanese)Chinese-English code-switching with no language tagsLyrics transcription under accompaniment and pitch variationMulti-speaker and noisy environment robustnessNative punctuation, no post-processing neededMIT license, Python API, Gradio demo, self-hostable8B open-source speech recognition model

Visit MiMo-V2.5 Voice →

Our Take

Xiaomi just dropped an 8B open-source speech model that actually competes with Whisper on accuracy. MiMo-2.5-ASR handles eight Chinese dialects, code-switched Chinese-English speech, AND song lyrics — no language-tagging post-processing required. The numbers back it up: 5.73% WER on English versus Whisper's 7.44%, 19.55% on Wu dialect versus FunASR's 29.08%, and 3.95% on lyrics. MIT licensed, free, self-hostable — and it addresses what the benchmark babies won't tell you: most ASR models look amazing on clean studio data and then quietly fail in production where audio is noisy, speakers overlap, and people switch languages mid-sentence. This is the move for voice product teams building bilingual or Chinese-language pipelines who need accuracy that actually holds up outside the lab.

MiMo-V2.5-ASR is an 8B open-source speech recognition model from Xiaomi that transcribes Mandarin, English, eight Chinese dialects, code-switched speech, and song lyrics. Built for ML engineers, researchers, and developers building real-world voice applications.

Problem It Solves

Most ASR models are benchmarked on clean studio data and deployed into the real world, where audio is noisy, speakers overlap, and people switch languages mid-sentence. The gap between benchmark accuracy and production accuracy is where voice products quietly fail.

Target Customer

ML engineers and voice product teams building bilingual or Chinese-language transcription pipelines who need accuracy that holds up outside the lab.

Use Cases

Bilingual Chinese-English transcription, Regional dialect transcription, Song lyrics transcription, Voice applications for multilingual environments, Tourism audio guides

Pricing Details

MIT licensed, open-source, self-hostable

Free Tier

true

Differentiator

On Open ASR Leaderboard: 5.73% WER on English vs Whisper large-v3 at 7.44%, 19.55% on Wu dialect vs FunASR-1.5 at 29.08%, 3.95% on lyrics vs Gemini 2.5 Pro at 4.25%. Staged training combining mid-training, supervised fine-tuning, and reinforcement learning specifically targeting real-world scenarios.

Why Now

Open-source ASR has been catching up to closed models for years. MiMo-V2.5-ASR demonstrates the gap is now very small, and in some scenarios gone.

Traction

Notable Metrics: 110 followers, 114 points, Day Rank #7

Key Facts

Links

Website GitHub Source: product-hunt

Want products like this in your inbox every morning?

Five products. Every morning. Written by someone who actually cares whether they're good or not. Free forever, unsubscribe whenever.

MiMo-V2.5 Voice

Key Facts

Links

Similar products worth knowing

Thoth

Want products like this in your inbox every morning?