Every ASR Model on Transcribe.so: Benchmarks, Pricing, and When to Use Each

Transcribe.soMar 11, 2026

ASRspeech to textGPT-4o transcriptionQwen3-ASR-FlashElevenLabs ScribeGoogle GeminiMistral VoxtralAmazon TranscribeWER benchmarktranscription accuracyAI transcription

Why we support multiple ASR models

There is no single best transcription model. The right model depends on your content — how many speakers, what language, how long the recording, and whether you need word-level timestamps or speaker labels.

That's why Transcribe.so lets you choose your ASR pipeline per transcription. Today we support three world-class models, with three more coming soon. Every model feeds into the same downstream AI pipeline: topics, chapters, summaries, semantic search, Q&A with citations, and subtitle export.

Here's the full breakdown.

Currently supported models

GPT-4o Transcribe Diarize

Provider: OpenAI Best for: Multi-speaker content where "who said what" matters

GPT-4o Transcribe Diarize is OpenAI's premium transcription model with built-in speaker identification — a capability no other single-API model matches at this quality level. If your audio has multiple speakers, this is the model to use.

Spec	Detail
Speaker diarization	Yes (automatic speaker labels)
Languages	57
Timestamp type	Segment-level with speaker attribution
Max audio duration	Unlimited (chunked processing)
Word-level timestamps	No (segment-level)
Emotion detection	No

Pricing on Transcribe.so:

Tier	Rate
Free	$3.88/hr
Basic ($12/mo)	$3.61/hr
Plus ($39/mo)	$3.45/hr
Pro ($99/mo)	$3.18/hr

When to choose GPT-4o Diarize:

Podcasts, interviews, meetings, panel discussions
Any content where speaker labels are essential
Multi-speaker audio where you need to know who said what

Qwen3-ASR-Flash

Provider: Alibaba Qwen Best for: Maximum accuracy, word-level timestamps, long-form audio, Chinese dialects

Qwen3-ASR-Flash is ranked #1 on the HuggingFace Open ASR Leaderboard with a 4.25% average Word Error Rate — nearly 2x better than Whisper-large-v3.

Spec	Detail
Speaker diarization	No
Languages	52 + 22 Chinese dialects
Timestamp type	Sentence + word-level (10 languages)
Max audio duration	12 hours native (no chunking)
Word-level timestamps	Yes
Emotion detection	Yes

Pricing on Transcribe.so:

Tier	Rate
Free	$1.71/hr
Basic ($12/mo)	$1.59/hr
Plus ($39/mo)	$1.52/hr
Pro ($99/mo)	$1.40/hr

For a detailed deep-dive on Qwen3-ASR-Flash, see the launch announcement.

When to choose Qwen3-ASR-Flash:

Single-speaker content (lectures, audiobooks, webinars)
Subtitle generation (word-level timestamps enable precise cue boundaries)
Long-form audio (3+ hours) — 12-hour native support means no chunking artifacts
Chinese dialect content (Cantonese, Sichuanese, Fujian, and 19 more)
When you want the lowest WER available

Benchmark comparison: Open ASR Leaderboard

The HuggingFace Open ASR Leaderboard is the most widely used community benchmark for speech-to-text models. It evaluates models across 9 diverse test sets and reports average Word Error Rate (WER). Lower is better.

Qwen3-ASR-Flash vs other top models

Dataset	Qwen3-ASR-Flash	NVIDIA Canary-1B	Whisper-large-v3	Whisper-large-v3-turbo
LibriSpeech Clean	1.61%	~2.5%	~2.7%	~3.0%
LibriSpeech Other	2.88%	~5.0%	~5.5%	~6.0%
SPGISpeech	2.06%	~3.5%	~4.0%	~4.2%
Tedlium	3.20%	~5.5%	~4.5%	~5.0%
VoxPopuli	6.39%	~7.0%	~8.5%	~9.0%
Common Voice 9	7.42%	~9.0%	~10.0%	~11.0%
GigaSpeech	8.88%	~10.0%	~11.0%	~11.5%
Earnings22	10.68%	~12.0%	~14.0%	~15.0%
AMI	11.29%	~15.0%	~16.0%	~17.0%
Average WER	4.25%	~7.5%	~8.0%	~8.5%

Qwen3-ASR-Flash leads on every single benchmark dataset.

Artificial Analysis rankings (AA-WER v2.0)

Artificial Analysis uses a different benchmark methodology (AA-AgentTalk 50%, VoxPopuli-Cleaned-AA 25%, Earnings22-Cleaned-AA 25%) and ranks models independently.

Rank	Model	Provider	AA-WER
1	Scribe v2	ElevenLabs	2.3%
2	Gemini 3 Pro	Google	2.9%
3	Voxtral Small	Mistral	3.0%
4	Gemini 2.5 Pro	Google	3.1%
5	Gemini 3 Flash	Google	3.1%

A note on benchmark methodology: Qwen3-ASR-Flash is not yet listed on Artificial Analysis, and the two leaderboards use different test sets and scoring. Direct WER numbers aren't comparable across leaderboards — a model scoring 4.25% on the Open ASR Leaderboard's 9-dataset average isn't necessarily "worse" than one scoring 2.3% on Artificial Analysis's 3-dataset composite. What matters is that both leaderboards identify the top-performing models, and we plan to support the best from each.

Voxtral Mini Transcribe

Provider: Mistral AI Best for: Word-level timestamps, subtitle generation, budget-friendly transcription

Voxtral Mini Transcribe is Mistral AI's dedicated transcription model with word-level timestamps and speaker diarization across 40 languages. At $0.003/min for transcription, it's the most cost-effective option with word-level precision.

Spec	Detail
Speaker diarization	Yes
Languages	40
Timestamp type	Sentence + word-level (all languages)
Context biasing	Yes — up to 100 custom terms
Word-level timestamps	Yes
AA-WER	3.0% (Voxtral Small)

When to choose Voxtral Mini Transcribe:

Subtitle generation where every word needs precise timing
Budget-conscious transcription — lowest transcription cost per minute
Content with proper nouns or technical terms (context biasing helps accuracy)
Multi-speaker content requiring both diarization and word timestamps

Coming soon

We're adding three more ASR pipelines. Each will be available as an additional option in the pipeline selector, with the same downstream AI analysis (topics, chapters, search, Q&A, subtitles).

ElevenLabs Scribe v2

Provider: ElevenLabs AA-WER: 2.3% — #1 on Artificial Analysis

Spec	Detail
Speaker diarization	Yes
Languages	99
Timestamps	Word-level
Latency	Low (optimized for real-time)
Notable	Highest accuracy on Artificial Analysis, supports audio events and sound detection

Why we're adding it: Scribe v2 tops the Artificial Analysis leaderboard with the lowest WER of any model tested. Combined with 99-language support and speaker diarization, it could be the best all-around option for many use cases.

Google Gemini

Provider: Google DeepMind AA-WER: 2.9% (Gemini 3 Pro) / 3.1% (Gemini 3 Flash)

Spec	Detail
Speaker diarization	Varies by model
Languages	100+
Timestamps	Varies
Context window	Up to 1M tokens (audio native)
Notable	Multimodal — can process audio natively alongside text and video

Why we're adding it: Gemini's multimodal architecture processes audio natively rather than converting to text through a separate ASR pipeline. The long context window means entire recordings can be processed in a single pass, and Google's models consistently rank in the top 5 on Artificial Analysis.

Amazon Transcribe

Provider: AWS

Spec	Detail
Speaker diarization	Yes
Languages	100+
Custom vocabulary	Yes (domain-specific terms)
Custom language models	Yes
Notable	Enterprise-grade with HIPAA eligibility, PCI DSS compliance, custom vocabulary for domain-specific accuracy

Why we're adding it: Amazon Transcribe is the enterprise choice. Custom vocabulary support means medical, legal, and technical content gets domain-specific accuracy improvements that general models can't match. AWS compliance certifications make it suitable for regulated industries.

Model selection guide

Current models

Use case	Recommended	Why
Multiple speakers (podcast, meeting, interview)	GPT-4o Diarize	Built-in speaker labels — see the podcast transcription guide for show notes best practices
Single speaker, maximum accuracy	Qwen3-ASR-Flash	#1 WER on Open ASR Leaderboard
Subtitle generation	Qwen3-ASR-Flash	Word-level timestamps for precise cue boundaries — see the subtitle export comparison
Chinese dialects	Qwen3-ASR-Flash	22 dialect support
Long-form audio (3+ hours)	Qwen3-ASR-Flash	12-hour native, no chunking. Longer audio also benefits from automatic chapter generation
Budget-conscious	Qwen3-ASR-Flash	$1.71/hr vs $3.88/hr
Meeting transcription with speaker IDs	GPT-4o Diarize	Automatic speaker identification

When upcoming models arrive

Use case	Recommended	Why
Best overall accuracy + diarization	ElevenLabs Scribe v2	2.3% WER + speaker labels + 99 languages
Multimodal / video+audio analysis	Google Gemini	Native audio understanding in multimodal context
Open-source preference	Mistral Voxtral	Best open-weight ASR (3.0% WER)
Enterprise / regulated industry	Amazon Transcribe	HIPAA, custom vocabulary, compliance certifications
Maximum language coverage	ElevenLabs Scribe v2 or Google Gemini	99-100+ languages

How pricing works

Every model on Transcribe.so follows the same pricing structure: pay-per-minute with no subscription lock-in. Subscription tiers reduce the per-minute rate.

The transcription cost varies by model, but the downstream AI pipeline (GPT-4.1 for analysis, text-embedding-3-large for semantic search, infrastructure) is shared across all models.

Component	GPT-4o Pipeline	Qwen3 Pipeline
Transcription API	$1.80/hr	$0.13/hr
LLM analysis (GPT-4.1)	$0.48/hr	$0.48/hr
Embeddings	$0.06/hr	$0.06/hr
Infrastructure	$1.00/hr	$1.00/hr
Provider total	$3.34/hr	$1.67/hr

Upcoming models will have their own transcription API rates, but the shared pipeline cost stays the same.

What every model gets

Regardless of which ASR model you choose, every transcription on Transcribe.so gets the same AI enrichment:

Topic detection and keyword extraction
Chapter generation with titles and summaries
Semantic search across your transcript library (3072-dimensional embeddings)
AI Q&A with citations — ask questions, get answers with exact timestamps
AI summary with takeaways, key quotes, and speaker profiles
Subtitle export — SRT, VTT, karaoke VTT, and JSON with full constraint controls

The ASR model is the first step. Everything after it is the same pipeline.

Benchmarks and leaderboards

Two independent leaderboards track ASR model performance. We reference both when evaluating models:

HuggingFace Open ASR Leaderboard — Community benchmark using 9 diverse test sets (LibriSpeech, AMI, Earnings22, GigaSpeech, etc.). Reports average WER. Qwen3-ASR-Flash is #1.
Artificial Analysis — Speech-to-Text — Independent benchmark using AA-WER v2.0 methodology (AA-AgentTalk, VoxPopuli-Cleaned-AA, Earnings22-Cleaned-AA). Includes speed and pricing comparisons. ElevenLabs Scribe v2 is #1.

Different methodologies, different rankings — both valuable. We aim to support the top models from each.

Choose Your ASR Model: One Platform, Every Top Speech-to-Text Model — why model choice matters and how the single-workflow approach works
AI Transcription for Content Creators — subtitle and chapter workflows for YouTube, TikTok, and podcasts
How to Import AI Subtitles into CapCut, Premiere Pro, DaVinci Resolve & Final Cut Pro — step-by-step import guide for each editor

Try it

Choose your model at transcribe.so/transcribe. Upload a file or paste a YouTube URL, pick your pipeline, and get results in minutes. All plans include every AI feature — no per-feature upsells.