What is WER (word error rate)?

WER counts the minimum number of word substitutions, deletions, and insertions needed to turn the transcript into the reference text, divided by the number of words in the reference. A WER of 3% means roughly 3 mistakes per 100 words. Lower is better. For languages without spaces between words, such as Chinese and Japanese, the same calculation is done per character instead.

Did transcribe.so measure these numbers?

No. Every number on this page is a published figure from the HuggingFace Open ASR Leaderboard, the Qwen3-ASR technical report and model card, or the Voxtral paper, with links to each source. We never estimate a number a vendor has not published, and we show a blank cell instead of a guess.

Why is there no Whisper number for Korean or Japanese?

OpenAI publishes per-language FLEURS results for Whisper large-v3 only in aggregate form, and the same-split comparisons on the Qwen3-ASR-1.7B model card cover English, Mandarin, and Cantonese only. Rather than estimating, we leave the cell blank. The English and Chinese comparisons, where same-split numbers exist, show Whisper large-v3 making roughly 1.5x to 2.3x more errors than the Qwen3-ASR family.

Accuracy benchmarks

Transcription accuracy by language

Published word error rates for the models transcribe.so ships, next to Whisper large-v3 and other models where the vendors publish comparable numbers. Every figure links to its source. Nothing on this page is estimated, and we show a blank cell rather than a guess.

Last updated 2026-06-12. Jump to: Korean · Japanese · Chinese (Mandarin) · English · Methodology · Limitations

How to read these numbers

WER (word error rate) is the standard accuracy metric for speech-to-text. It counts wrong, missing, and extra words against a human-verified reference transcript. A WER of 3% means about 3 mistakes per 100 words. Lower is better. For languages written without spaces, such as Chinese and Japanese, the same count is done per character. Percentages are comparable between models on the same language and test set, not across languages.

WER by language (FLEURS)

FLEURS is a public multilingual benchmark of read speech built by Google Research. The table shows every language for which Qwen3-ASR-Flash, the default high-accuracy model on transcribe.so, has a published score, sorted by accuracy. Other columns show published figures only where they exist for the same benchmark.

Language	Qwen3-ASR-Flashships on transcribe.so	Whisper large-v3same-split comparison	GPT-4o Transcribeships on transcribe.so	Voxtral SmallMistral published
ItalianItaliano	1.60%	not published	not published	2.31%
Korean한국어	2.07%	not published	not published	not published
Chinese (Mandarin)中文	2.38%	4.09%	2.44%	not published
SpanishEspañol	2.68%	not published	not published	2.75%
EnglishEnglish	2.72%	4.08%	2.40%	3.61%
GermanDeutsch	3.03%	not published	not published	3.54%
Japanese日本語	3.09%	not published	not published	not published
PortuguesePortuguês	3.18%	not published	not published	3.57%
FrenchFrançais	3.44%	not published	not published	4.22%
Cantonese粵語	3.50%	9.18%	4.98%	not published
VietnameseTiếng Việt	3.64%	not published	not published	not published
IndonesianBahasa Indonesia	3.65%	not published	not published	not published
DutchNederlands	4.35%	not published	not published	4.89%
RussianРусский	4.81%	not published	not published	not published
Thaiภาษาไทย	5.53%	not published	not published	not published
TurkishTürkçe	6.13%	not published	not published	not published
PolishPolski	7.24%	not published	not published	not published
RomanianRomână	10.45%	not published	not published	not published
MalayBahasa Melayu	11.37%	not published	not published	not published
DanishDansk	11.85%	not published	not published	not published
FinnishSuomi	12.21%	not published	not published	not published
Hindiहिन्दी	13.77%	not published	not published	10.32%
GreekΕλληνικά	13.85%	not published	not published	not published
Arabicالعربية	14.78%	not published	not published	14.64%
SwedishSvenska	15.02%	not published	not published	not published
Persianفارسی	18.37%	not published	not published	not published
CzechČeština	18.68%	not published	not published	not published
FilipinoFilipino	19.17%	not published	not published	not published
HungarianMagyar	21.77%	not published	not published	not published

Sources: Qwen3-ASR-Flash from the Qwen3-ASR technical report, Table A.2(b). Whisper large-v3 from the Qwen3-ASR-1.7B model card, which reports Whisper on the same FLEURS splits (published for English, Mandarin, and Cantonese only). GPT-4o Transcribe from the same technical report, Table 3. Voxtral Small from the Voxtral paper, Table 4 (Mistral). Benchmark: FLEURS benchmark (Google Research).

English, multi-domain: Open ASR Leaderboard

The HuggingFace Open ASR Leaderboard is the cleanest apples-to-apples comparison available: an independent third party re-runs every submitted model on the same 8 English datasets with the same text normalization and open-source scripts. It covers harder, real-world audio than FLEURS: AMI (meetings), Earnings22 (earnings calls), GigaSpeech (web audio and podcasts), LibriSpeech clean and other (audiobooks), SPGISpeech (financial calls), TED-LIUM (talks), VoxPopuli (parliament speech).

Rank	Model	Provider	Avg WER (8 datasets)
#4	Qwen3-ASR-1.7Btranscribe.so family	Alibaba (Qwen)	6.37%
#8	Scribe v2	ElevenLabs	6.64%
#9	Universal-3 Pro	AssemblyAI	6.80%
#15	Parakeet-TDT-0.6B-v3	NVIDIA	7.11%
#19	Chirp 3	Google	7.37%
#34	Whisper large-v3	OpenAI	8.13%
#43	Whisper large-v3-turbo	OpenAI	8.51%

Selected rows from the HuggingFace Open ASR Leaderboard, snapshot 2026-05-05, out of 80+ ranked models. The leaderboard evaluates the open-weights Qwen3-ASR-1.7B; transcribe.so ships Qwen3-ASR-Flash, the hosted top tier of the same family, which is not separately ranked. Methodology: Open ASR Leaderboard methodology paper (Srivastav et al.).

Language details

Korean한국어

Qwen3-ASR-Flash

2.07%

Qwen3-ASR-1.7B

2.57%

Korean is the second-best language in the entire Qwen3-ASR-Flash FLEURS table at 2.07% WER, behind only Italian. Roughly 2 words in 100 are wrong on benchmark audio, which is the territory where a transcript is usable without a correction pass.

OpenAI has not published per-language FLEURS results for Whisper large-v3, so we show no Whisper number for Korean rather than estimating one. Korean services built on Whisper-family models inherit whatever Whisper does on Korean, and none of the Korean transcription products we are aware of publish a measured WER at all.

On transcribe.so, Korean audio runs through Qwen3-ASR-Flash by default. Word-level timestamps are supported, and diarization is included in the base price.

Japanese日本語

Qwen3-ASR-Flash

3.09%

Qwen3-ASR-1.7B

5.20%

Japanese scores 3.09% on FLEURS with Qwen3-ASR-Flash. The gap between Flash (3.09%) and the open-weights 1.7B variant (5.20%) is larger for Japanese than for most languages, so for Japanese specifically the hosted model transcribe.so ships is meaningfully stronger than the open-weights sibling that self-hosted Whisper alternatives are often compared against.

Japanese error rates are computed at character level because Japanese text has no whitespace word boundaries. Character-level and word-level percentages are not directly comparable across languages; compare models within a language instead.

As with Korean, OpenAI publishes no per-language FLEURS figure for Whisper large-v3 on Japanese, so no Whisper column is shown. We do not estimate numbers a vendor has not published.

Chinese (Mandarin)中文

Qwen3-ASR-Flash

2.38%

Qwen3-ASR-1.7B

2.41%

Whisper large-v3

4.09%

GPT-4o Transcribe

2.44%

Chinese is the one CJK language with a clean same-split Whisper comparison. On the Qwen3-ASR-1.7B model card, the open-weights 1.7B model scores 2.41% against Whisper large-v3 at 4.09% on the same FLEURS splits, a 41% relative reduction in errors. The Flash model transcribe.so ships scores 2.38% in the technical report, and GPT-4o Transcribe reports 2.44%.

The gap widens dramatically for Cantonese: 3.50% for Flash and 3.98% for the 1.7B model against 9.18% for Whisper large-v3 on the same splits. Whisper-derived tools make roughly 2.3x more errors on Cantonese benchmark audio.

The Qwen3-ASR family also supports 22 Chinese dialect and accent variants, which no other general-purpose ASR model currently matches.

EnglishEnglish

Qwen3-ASR-Flash

2.72%

Qwen3-ASR-1.7B

3.35%

Whisper large-v3

4.08%

GPT-4o Transcribe

2.40%

English has the most independent evidence. On the HuggingFace Open ASR Leaderboard, where a third party re-runs every model on the same 8 datasets with the same normalization, Qwen3-ASR-1.7B averages 6.37% WER against 8.13% for Whisper large-v3. That is 21.6% fewer errors at near-identical throughput, measured by neither Alibaba nor us.

On FLEURS read speech, Flash scores 2.72% and GPT-4o Transcribe, which transcribe.so also ships for diarized English transcription, reports 2.40%, against 4.08% for Whisper large-v3 on the same splits per the Qwen3-ASR-1.7B model card.

The leaderboard average is harder than FLEURS because it includes meetings, earnings calls, and noisy web audio rather than read sentences. Expect absolute error rates on real-world recordings to look more like the leaderboard numbers than the FLEURS numbers, for every model.

Methodology

Where the numbers come from.We did not run these benchmarks ourselves. Every figure is a published number from one of the sources below, reproduced exactly. When a vendor has not published a number for a model and language, the cell says so. We never estimate a competitor's accuracy, and we do not compare numbers across different benchmarks.

FLEURS.A public benchmark from Google Research covering 102 languages. Each language's test set is read speech: sentences from the FLoRes translation corpus recorded by native speakers, a few hours of audio per language. Per-language figures in the table above are word error rates, except for languages without whitespace word boundaries (Chinese, Japanese, Cantonese), which are scored per character, as is standard practice in the cited sources.

Open ASR Leaderboard. Maintained by HuggingFace, with open-source evaluation scripts. Every submitted model is re-run by the leaderboard itself on 8 English datasets spanning meetings, earnings calls, audiobooks, talks, and parliament speech. Before scoring, both the model output and the reference pass through the Whisper text normalizer: lowercasing, punctuation removal, number and abbreviation expansion. This matters because normalization differences alone can move WER by whole percentage points, which is one reason vendor-published accuracy claims without stated normalization rules are not comparable. The reported figure is the macro-average across the 8 datasets. Snapshot used here: 2026-05-05.

Model versions.Qwen3-ASR-Flash refers to Qwen3-ASR-Flash-1208 as reported in the Qwen3-ASR technical report. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are the open-weights variants. Whisper large-v3 is OpenAI's strongest open Whisper release. GPT-4o Transcribe figures are from the same technical report (Table 3). Voxtral Small figures are Mistral's own, from the Voxtral paper (Table 4).

What transcribe.so ships. The default high-accuracy pipeline runs Qwen3-ASR-Flash. GPT-4o Transcribe Diarize and Voxtral Mini Transcribe are also available, and the platform picks the best model per file based on language, length, and whether speaker labels are needed. Diarization is included in the base price, not an add-on.

Honest limitations

FLEURS is read speech recorded in quiet conditions. Real meetings, lectures, and podcasts are noisier and more spontaneous, so absolute error rates on your audio will be higher for every model on this page. The relative ordering between models is the informative part.
The per-language table leans on one corpus. A model can rank differently on other test sets, accents, or domains. Where a multi-domain comparison exists (English, via the Open ASR Leaderboard), we show it separately.
Whisper large-v3 has no vendor-published per-language FLEURS results for most languages, including Korean and Japanese. We leave those cells blank instead of estimating, which means the table understates how often Whisper comparisons are simply unavailable.
The Open ASR Leaderboard ranks the open-weights Qwen3-ASR-1.7B, not the hosted Qwen3-ASR-Flash that transcribe.so ships. Qwen positions Flash as the stronger top tier of the same family, but Flash itself is not independently ranked there.
Benchmarks do not cover heavily accented speech, domain jargon (medical, legal), overlapping speakers, or very low-quality recordings. No public benchmark we are aware of does this well across languages.
Accuracy claims you may see elsewhere, such as "98.86% accurate", are usually published without a test set, normalization rules, or model version, and cannot be compared to anything on this page.

Test it on your own audio

The only benchmark that matters is your recording. Free credits included, no credit card required, and you see the exact price before you confirm.

Transcribe a file free See per-minute pricing

Or start from a link: YouTube to transcript, podcast transcription, meeting transcription.