Accuracy benchmarks

Transcription accuracy by language

Published word error rates for the models transcribe.so ships, next to Whisper large-v3 and other models where the vendors publish comparable numbers. Every figure links to its source. Nothing on this page is estimated, and we show a blank cell rather than a guess.

Last updated 2026-06-12. Jump to: Korean · Japanese · Chinese (Mandarin) · English · Methodology · Limitations

How to read these numbers

WER (word error rate) is the standard accuracy metric for speech-to-text. It counts wrong, missing, and extra words against a human-verified reference transcript. A WER of 3% means about 3 mistakes per 100 words. Lower is better. For languages written without spaces, such as Chinese and Japanese, the same count is done per character. Percentages are comparable between models on the same language and test set, not across languages.

WER by language (FLEURS)

FLEURS is a public multilingual benchmark of read speech built by Google Research. The table shows every language for which Qwen3-ASR-Flash, the default high-accuracy model on transcribe.so, has a published score, sorted by accuracy. Other columns show published figures only where they exist for the same benchmark.

LanguageQwen3-ASR-Flashships on transcribe.soWhisper large-v3same-split comparisonGPT-4o Transcribeships on transcribe.soVoxtral SmallMistral published
ItalianItaliano1.60%not publishednot published2.31%
Korean한국어2.07%not publishednot publishednot published
Chinese (Mandarin)中文2.38%4.09%2.44%not published
SpanishEspañol2.68%not publishednot published2.75%
EnglishEnglish2.72%4.08%2.40%3.61%
GermanDeutsch3.03%not publishednot published3.54%
Japanese日本語3.09%not publishednot publishednot published
PortuguesePortuguês3.18%not publishednot published3.57%
FrenchFrançais3.44%not publishednot published4.22%
Cantonese粵語3.50%9.18%4.98%not published
VietnameseTiếng Việt3.64%not publishednot publishednot published
IndonesianBahasa Indonesia3.65%not publishednot publishednot published
DutchNederlands4.35%not publishednot published4.89%
RussianРусский4.81%not publishednot publishednot published
Thaiภาษาไทย5.53%not publishednot publishednot published
TurkishTürkçe6.13%not publishednot publishednot published
PolishPolski7.24%not publishednot publishednot published
RomanianRomână10.45%not publishednot publishednot published
MalayBahasa Melayu11.37%not publishednot publishednot published
DanishDansk11.85%not publishednot publishednot published
FinnishSuomi12.21%not publishednot publishednot published
Hindiहिन्दी13.77%not publishednot published10.32%
GreekΕλληνικά13.85%not publishednot publishednot published
Arabicالعربية14.78%not publishednot published14.64%
SwedishSvenska15.02%not publishednot publishednot published
Persianفارسی18.37%not publishednot publishednot published
CzechČeština18.68%not publishednot publishednot published
FilipinoFilipino19.17%not publishednot publishednot published
HungarianMagyar21.77%not publishednot publishednot published

Sources: Qwen3-ASR-Flash from the Qwen3-ASR technical report, Table A.2(b). Whisper large-v3 from the Qwen3-ASR-1.7B model card, which reports Whisper on the same FLEURS splits (published for English, Mandarin, and Cantonese only). GPT-4o Transcribe from the same technical report, Table 3. Voxtral Small from the Voxtral paper, Table 4 (Mistral). Benchmark: FLEURS benchmark (Google Research).

English, multi-domain: Open ASR Leaderboard

The HuggingFace Open ASR Leaderboard is the cleanest apples-to-apples comparison available: an independent third party re-runs every submitted model on the same 8 English datasets with the same text normalization and open-source scripts. It covers harder, real-world audio than FLEURS: AMI (meetings), Earnings22 (earnings calls), GigaSpeech (web audio and podcasts), LibriSpeech clean and other (audiobooks), SPGISpeech (financial calls), TED-LIUM (talks), VoxPopuli (parliament speech).

RankModelProviderAvg WER (8 datasets)
#4Qwen3-ASR-1.7Btranscribe.so familyAlibaba (Qwen)6.37%
#8Scribe v2ElevenLabs6.64%
#9Universal-3 ProAssemblyAI6.80%
#15Parakeet-TDT-0.6B-v3NVIDIA7.11%
#19Chirp 3Google7.37%
#34Whisper large-v3OpenAI8.13%
#43Whisper large-v3-turboOpenAI8.51%

Selected rows from the HuggingFace Open ASR Leaderboard, snapshot 2026-05-05, out of 80+ ranked models. The leaderboard evaluates the open-weights Qwen3-ASR-1.7B; transcribe.so ships Qwen3-ASR-Flash, the hosted top tier of the same family, which is not separately ranked. Methodology: Open ASR Leaderboard methodology paper (Srivastav et al.).

Language details

Korean한국어

Qwen3-ASR-Flash
2.07%
Qwen3-ASR-1.7B
2.57%

Korean is the second-best language in the entire Qwen3-ASR-Flash FLEURS table at 2.07% WER, behind only Italian. Roughly 2 words in 100 are wrong on benchmark audio, which is the territory where a transcript is usable without a correction pass.

OpenAI has not published per-language FLEURS results for Whisper large-v3, so we show no Whisper number for Korean rather than estimating one. Korean services built on Whisper-family models inherit whatever Whisper does on Korean, and none of the Korean transcription products we are aware of publish a measured WER at all.

On transcribe.so, Korean audio runs through Qwen3-ASR-Flash by default. Word-level timestamps are supported, and diarization is included in the base price.

Japanese日本語

Qwen3-ASR-Flash
3.09%
Qwen3-ASR-1.7B
5.20%

Japanese scores 3.09% on FLEURS with Qwen3-ASR-Flash. The gap between Flash (3.09%) and the open-weights 1.7B variant (5.20%) is larger for Japanese than for most languages, so for Japanese specifically the hosted model transcribe.so ships is meaningfully stronger than the open-weights sibling that self-hosted Whisper alternatives are often compared against.

Japanese error rates are computed at character level because Japanese text has no whitespace word boundaries. Character-level and word-level percentages are not directly comparable across languages; compare models within a language instead.

As with Korean, OpenAI publishes no per-language FLEURS figure for Whisper large-v3 on Japanese, so no Whisper column is shown. We do not estimate numbers a vendor has not published.

Chinese (Mandarin)中文

Qwen3-ASR-Flash
2.38%
Qwen3-ASR-1.7B
2.41%
Whisper large-v3
4.09%
GPT-4o Transcribe
2.44%

Chinese is the one CJK language with a clean same-split Whisper comparison. On the Qwen3-ASR-1.7B model card, the open-weights 1.7B model scores 2.41% against Whisper large-v3 at 4.09% on the same FLEURS splits, a 41% relative reduction in errors. The Flash model transcribe.so ships scores 2.38% in the technical report, and GPT-4o Transcribe reports 2.44%.

The gap widens dramatically for Cantonese: 3.50% for Flash and 3.98% for the 1.7B model against 9.18% for Whisper large-v3 on the same splits. Whisper-derived tools make roughly 2.3x more errors on Cantonese benchmark audio.

The Qwen3-ASR family also supports 22 Chinese dialect and accent variants, which no other general-purpose ASR model currently matches.

EnglishEnglish

Qwen3-ASR-Flash
2.72%
Qwen3-ASR-1.7B
3.35%
Whisper large-v3
4.08%
GPT-4o Transcribe
2.40%

English has the most independent evidence. On the HuggingFace Open ASR Leaderboard, where a third party re-runs every model on the same 8 datasets with the same normalization, Qwen3-ASR-1.7B averages 6.37% WER against 8.13% for Whisper large-v3. That is 21.6% fewer errors at near-identical throughput, measured by neither Alibaba nor us.

On FLEURS read speech, Flash scores 2.72% and GPT-4o Transcribe, which transcribe.so also ships for diarized English transcription, reports 2.40%, against 4.08% for Whisper large-v3 on the same splits per the Qwen3-ASR-1.7B model card.

The leaderboard average is harder than FLEURS because it includes meetings, earnings calls, and noisy web audio rather than read sentences. Expect absolute error rates on real-world recordings to look more like the leaderboard numbers than the FLEURS numbers, for every model.

Methodology

Where the numbers come from.We did not run these benchmarks ourselves. Every figure is a published number from one of the sources below, reproduced exactly. When a vendor has not published a number for a model and language, the cell says so. We never estimate a competitor's accuracy, and we do not compare numbers across different benchmarks.

FLEURS.A public benchmark from Google Research covering 102 languages. Each language's test set is read speech: sentences from the FLoRes translation corpus recorded by native speakers, a few hours of audio per language. Per-language figures in the table above are word error rates, except for languages without whitespace word boundaries (Chinese, Japanese, Cantonese), which are scored per character, as is standard practice in the cited sources.

Open ASR Leaderboard. Maintained by HuggingFace, with open-source evaluation scripts. Every submitted model is re-run by the leaderboard itself on 8 English datasets spanning meetings, earnings calls, audiobooks, talks, and parliament speech. Before scoring, both the model output and the reference pass through the Whisper text normalizer: lowercasing, punctuation removal, number and abbreviation expansion. This matters because normalization differences alone can move WER by whole percentage points, which is one reason vendor-published accuracy claims without stated normalization rules are not comparable. The reported figure is the macro-average across the 8 datasets. Snapshot used here: 2026-05-05.

Model versions.Qwen3-ASR-Flash refers to Qwen3-ASR-Flash-1208 as reported in the Qwen3-ASR technical report. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are the open-weights variants. Whisper large-v3 is OpenAI's strongest open Whisper release. GPT-4o Transcribe figures are from the same technical report (Table 3). Voxtral Small figures are Mistral's own, from the Voxtral paper (Table 4).

What transcribe.so ships. The default high-accuracy pipeline runs Qwen3-ASR-Flash. GPT-4o Transcribe Diarize and Voxtral Mini Transcribe are also available, and the platform picks the best model per file based on language, length, and whether speaker labels are needed. Diarization is included in the base price, not an add-on.

Honest limitations

  • FLEURS is read speech recorded in quiet conditions. Real meetings, lectures, and podcasts are noisier and more spontaneous, so absolute error rates on your audio will be higher for every model on this page. The relative ordering between models is the informative part.
  • The per-language table leans on one corpus. A model can rank differently on other test sets, accents, or domains. Where a multi-domain comparison exists (English, via the Open ASR Leaderboard), we show it separately.
  • Whisper large-v3 has no vendor-published per-language FLEURS results for most languages, including Korean and Japanese. We leave those cells blank instead of estimating, which means the table understates how often Whisper comparisons are simply unavailable.
  • The Open ASR Leaderboard ranks the open-weights Qwen3-ASR-1.7B, not the hosted Qwen3-ASR-Flash that transcribe.so ships. Qwen positions Flash as the stronger top tier of the same family, but Flash itself is not independently ranked there.
  • Benchmarks do not cover heavily accented speech, domain jargon (medical, legal), overlapping speakers, or very low-quality recordings. No public benchmark we are aware of does this well across languages.
  • Accuracy claims you may see elsewhere, such as "98.86% accurate", are usually published without a test set, normalization rules, or model version, and cannot be compared to anything on this page.

Test it on your own audio

The only benchmark that matters is your recording. Free credits included, no credit card required, and you see the exact price before you confirm.

Or start from a link: YouTube to transcript, podcast transcription, meeting transcription.