Where the numbers come from.We did not run these benchmarks ourselves. Every figure is a published number from one of the sources below, reproduced exactly. When a vendor has not published a number for a model and language, the cell says so. We never estimate a competitor's accuracy, and we do not compare numbers across different benchmarks.
FLEURS.A public benchmark from Google Research covering 102 languages. Each language's test set is read speech: sentences from the FLoRes translation corpus recorded by native speakers, a few hours of audio per language. Per-language figures in the table above are word error rates, except for languages without whitespace word boundaries (Chinese, Japanese, Cantonese), which are scored per character, as is standard practice in the cited sources.
Open ASR Leaderboard. Maintained by HuggingFace, with open-source evaluation scripts. Every submitted model is re-run by the leaderboard itself on 8 English datasets spanning meetings, earnings calls, audiobooks, talks, and parliament speech. Before scoring, both the model output and the reference pass through the Whisper text normalizer: lowercasing, punctuation removal, number and abbreviation expansion. This matters because normalization differences alone can move WER by whole percentage points, which is one reason vendor-published accuracy claims without stated normalization rules are not comparable. The reported figure is the macro-average across the 8 datasets. Snapshot used here: 2026-05-05.
Model versions.Qwen3-ASR-Flash refers to Qwen3-ASR-Flash-1208 as reported in the Qwen3-ASR technical report. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are the open-weights variants. Whisper large-v3 is OpenAI's strongest open Whisper release. GPT-4o Transcribe figures are from the same technical report (Table 3). Voxtral Small figures are Mistral's own, from the Voxtral paper (Table 4).
What transcribe.so ships. The default high-accuracy pipeline runs Qwen3-ASR-Flash. GPT-4o Transcribe Diarize and Voxtral Mini Transcribe are also available, and the platform picks the best model per file based on language, length, and whether speaker labels are needed. Diarization is included in the base price, not an add-on.