Every ASR Model on Transcribe.so: Benchmarks, Pricing, and When to Use Each
Why we support multiple ASR models
There is no single best transcription model. The right model depends on your content — how many speakers, what language, how long the recording, and whether you need word-level timestamps or speaker labels.
That's why Transcribe.so lets you choose your ASR pipeline per transcription. Today we support three world-class models, with three more coming soon. Every model feeds into the same downstream AI pipeline: topics, chapters, summaries, semantic search, Q&A with citations, and subtitle export.
Here's the full breakdown.
Currently supported models
GPT-4o Transcribe Diarize
Provider: OpenAI Best for: Multi-speaker content where "who said what" matters
GPT-4o Transcribe Diarize is OpenAI's premium transcription model with built-in speaker identification — a capability no other single-API model matches at this quality level. If your audio has multiple speakers, this is the model to use.
| Spec | Detail |
|---|---|
| Speaker diarization | Yes (automatic speaker labels) |
| Languages | 57 |
| Timestamp type | Segment-level with speaker attribution |
| Max audio duration | Unlimited (chunked processing) |
| Word-level timestamps | No (segment-level) |
| Emotion detection | No |
Pricing on Transcribe.so:
| Tier | Rate |
|---|---|
| Free | $3.88/hr |
| Basic ($12/mo) | $3.61/hr |
| Plus ($39/mo) | $3.45/hr |
| Pro ($99/mo) | $3.18/hr |
When to choose GPT-4o Diarize:
- Podcasts, interviews, meetings, panel discussions
- Any content where speaker labels are essential
- Multi-speaker audio where you need to know who said what
Qwen3-ASR-Flash
Provider: Alibaba Qwen Best for: Maximum accuracy, word-level timestamps, long-form audio, Chinese dialects
Qwen3-ASR-Flash is ranked #1 on the HuggingFace Open ASR Leaderboard with a 4.25% average Word Error Rate — nearly 2x better than Whisper-large-v3.
| Spec | Detail |
|---|---|
| Speaker diarization | No |
| Languages | 52 + 22 Chinese dialects |
| Timestamp type | Sentence + word-level (10 languages) |
| Max audio duration | 12 hours native (no chunking) |
| Word-level timestamps | Yes |
| Emotion detection | Yes |
Pricing on Transcribe.so:
| Tier | Rate |
|---|---|
| Free | $1.71/hr |
| Basic ($12/mo) | $1.59/hr |
| Plus ($39/mo) | $1.52/hr |
| Pro ($99/mo) | $1.40/hr |
For a detailed deep-dive on Qwen3-ASR-Flash, see the launch announcement.
When to choose Qwen3-ASR-Flash:
- Single-speaker content (lectures, audiobooks, webinars)
- Subtitle generation (word-level timestamps enable precise cue boundaries)
- Long-form audio (3+ hours) — 12-hour native support means no chunking artifacts
- Chinese dialect content (Cantonese, Sichuanese, Fujian, and 19 more)
- When you want the lowest WER available
Benchmark comparison: Open ASR Leaderboard
The HuggingFace Open ASR Leaderboard is the most widely used community benchmark for speech-to-text models. It evaluates models across 9 diverse test sets and reports average Word Error Rate (WER). Lower is better.
Qwen3-ASR-Flash vs other top models
| Dataset | Qwen3-ASR-Flash | NVIDIA Canary-1B | Whisper-large-v3 | Whisper-large-v3-turbo |
|---|---|---|---|---|
| LibriSpeech Clean | 1.61% | ~2.5% | ~2.7% | ~3.0% |
| LibriSpeech Other | 2.88% | ~5.0% | ~5.5% | ~6.0% |
| SPGISpeech | 2.06% | ~3.5% | ~4.0% | ~4.2% |
| Tedlium | 3.20% | ~5.5% | ~4.5% | ~5.0% |
| VoxPopuli | 6.39% | ~7.0% | ~8.5% | ~9.0% |
| Common Voice 9 | 7.42% | ~9.0% | ~10.0% | ~11.0% |
| GigaSpeech | 8.88% | ~10.0% | ~11.0% | ~11.5% |
| Earnings22 | 10.68% | ~12.0% | ~14.0% | ~15.0% |
| AMI | 11.29% | ~15.0% | ~16.0% | ~17.0% |
| Average WER | 4.25% | ~7.5% | ~8.0% | ~8.5% |
Qwen3-ASR-Flash leads on every single benchmark dataset.
Artificial Analysis rankings (AA-WER v2.0)
Artificial Analysis uses a different benchmark methodology (AA-AgentTalk 50%, VoxPopuli-Cleaned-AA 25%, Earnings22-Cleaned-AA 25%) and ranks models independently.
| Rank | Model | Provider | AA-WER |
|---|---|---|---|
| 1 | Scribe v2 | ElevenLabs | 2.3% |
| 2 | Gemini 3 Pro | 2.9% | |
| 3 | Voxtral Small | Mistral | 3.0% |
| 4 | Gemini 2.5 Pro | 3.1% | |
| 5 | Gemini 3 Flash | 3.1% |
A note on benchmark methodology: Qwen3-ASR-Flash is not yet listed on Artificial Analysis, and the two leaderboards use different test sets and scoring. Direct WER numbers aren't comparable across leaderboards — a model scoring 4.25% on the Open ASR Leaderboard's 9-dataset average isn't necessarily "worse" than one scoring 2.3% on Artificial Analysis's 3-dataset composite. What matters is that both leaderboards identify the top-performing models, and we plan to support the best from each.
Voxtral Mini Transcribe
Provider: Mistral AI Best for: Word-level timestamps, subtitle generation, budget-friendly transcription
Voxtral Mini Transcribe is Mistral AI's dedicated transcription model with word-level timestamps and speaker diarization across 40 languages. At $0.003/min for transcription, it's the most cost-effective option with word-level precision.
| Spec | Detail |
|---|---|
| Speaker diarization | Yes |
| Languages | 40 |
| Timestamp type | Sentence + word-level (all languages) |
| Context biasing | Yes — up to 100 custom terms |
| Word-level timestamps | Yes |
| AA-WER | 3.0% (Voxtral Small) |
When to choose Voxtral Mini Transcribe:
- Subtitle generation where every word needs precise timing
- Budget-conscious transcription — lowest transcription cost per minute
- Content with proper nouns or technical terms (context biasing helps accuracy)
- Multi-speaker content requiring both diarization and word timestamps
Coming soon
We're adding three more ASR pipelines. Each will be available as an additional option in the pipeline selector, with the same downstream AI analysis (topics, chapters, search, Q&A, subtitles).
ElevenLabs Scribe v2
Provider: ElevenLabs AA-WER: 2.3% — #1 on Artificial Analysis
| Spec | Detail |
|---|---|
| Speaker diarization | Yes |
| Languages | 99 |
| Timestamps | Word-level |
| Latency | Low (optimized for real-time) |
| Notable | Highest accuracy on Artificial Analysis, supports audio events and sound detection |
Why we're adding it: Scribe v2 tops the Artificial Analysis leaderboard with the lowest WER of any model tested. Combined with 99-language support and speaker diarization, it could be the best all-around option for many use cases.
Google Gemini
Provider: Google DeepMind AA-WER: 2.9% (Gemini 3 Pro) / 3.1% (Gemini 3 Flash)
| Spec | Detail |
|---|---|
| Speaker diarization | Varies by model |
| Languages | 100+ |
| Timestamps | Varies |
| Context window | Up to 1M tokens (audio native) |
| Notable | Multimodal — can process audio natively alongside text and video |
Why we're adding it: Gemini's multimodal architecture processes audio natively rather than converting to text through a separate ASR pipeline. The long context window means entire recordings can be processed in a single pass, and Google's models consistently rank in the top 5 on Artificial Analysis.
Amazon Transcribe
Provider: AWS
| Spec | Detail |
|---|---|
| Speaker diarization | Yes |
| Languages | 100+ |
| Custom vocabulary | Yes (domain-specific terms) |
| Custom language models | Yes |
| Notable | Enterprise-grade with HIPAA eligibility, PCI DSS compliance, custom vocabulary for domain-specific accuracy |
Why we're adding it: Amazon Transcribe is the enterprise choice. Custom vocabulary support means medical, legal, and technical content gets domain-specific accuracy improvements that general models can't match. AWS compliance certifications make it suitable for regulated industries.
Model selection guide
Current models
| Use case | Recommended | Why |
|---|---|---|
| Multiple speakers (podcast, meeting, interview) | GPT-4o Diarize | Built-in speaker labels — see the podcast transcription guide for show notes best practices |
| Single speaker, maximum accuracy | Qwen3-ASR-Flash | #1 WER on Open ASR Leaderboard |
| Subtitle generation | Qwen3-ASR-Flash | Word-level timestamps for precise cue boundaries — see the subtitle export comparison |
| Chinese dialects | Qwen3-ASR-Flash | 22 dialect support |
| Long-form audio (3+ hours) | Qwen3-ASR-Flash | 12-hour native, no chunking. Longer audio also benefits from automatic chapter generation |
| Budget-conscious | Qwen3-ASR-Flash | $1.71/hr vs $3.88/hr |
| Meeting transcription with speaker IDs | GPT-4o Diarize | Automatic speaker identification |
When upcoming models arrive
| Use case | Recommended | Why |
|---|---|---|
| Best overall accuracy + diarization | ElevenLabs Scribe v2 | 2.3% WER + speaker labels + 99 languages |
| Multimodal / video+audio analysis | Google Gemini | Native audio understanding in multimodal context |
| Open-source preference | Mistral Voxtral | Best open-weight ASR (3.0% WER) |
| Enterprise / regulated industry | Amazon Transcribe | HIPAA, custom vocabulary, compliance certifications |
| Maximum language coverage | ElevenLabs Scribe v2 or Google Gemini | 99-100+ languages |
How pricing works
Every model on Transcribe.so follows the same pricing structure: pay-per-minute with no subscription lock-in. Subscription tiers reduce the per-minute rate.
The transcription cost varies by model, but the downstream AI pipeline (GPT-4.1 for analysis, text-embedding-3-large for semantic search, infrastructure) is shared across all models.
| Component | GPT-4o Pipeline | Qwen3 Pipeline |
|---|---|---|
| Transcription API | $1.80/hr | $0.13/hr |
| LLM analysis (GPT-4.1) | $0.48/hr | $0.48/hr |
| Embeddings | $0.06/hr | $0.06/hr |
| Infrastructure | $1.00/hr | $1.00/hr |
| Provider total | $3.34/hr | $1.67/hr |
Upcoming models will have their own transcription API rates, but the shared pipeline cost stays the same.
What every model gets
Regardless of which ASR model you choose, every transcription on Transcribe.so gets the same AI enrichment:
- Topic detection and keyword extraction
- Chapter generation with titles and summaries
- Semantic search across your transcript library (3072-dimensional embeddings)
- AI Q&A with citations — ask questions, get answers with exact timestamps
- AI summary with takeaways, key quotes, and speaker profiles
- Subtitle export — SRT, VTT, karaoke VTT, and JSON with full constraint controls
The ASR model is the first step. Everything after it is the same pipeline.
Benchmarks and leaderboards
Two independent leaderboards track ASR model performance. We reference both when evaluating models:
-
HuggingFace Open ASR Leaderboard — Community benchmark using 9 diverse test sets (LibriSpeech, AMI, Earnings22, GigaSpeech, etc.). Reports average WER. Qwen3-ASR-Flash is #1.
-
Artificial Analysis — Speech-to-Text — Independent benchmark using AA-WER v2.0 methodology (AA-AgentTalk, VoxPopuli-Cleaned-AA, Earnings22-Cleaned-AA). Includes speed and pricing comparisons. ElevenLabs Scribe v2 is #1.
Different methodologies, different rankings — both valuable. We aim to support the top models from each.
Related
- Choose Your ASR Model: One Platform, Every Top Speech-to-Text Model — why model choice matters and how the single-workflow approach works
- AI Transcription for Content Creators — subtitle and chapter workflows for YouTube, TikTok, and podcasts
- How to Import AI Subtitles into CapCut, Premiere Pro, DaVinci Resolve & Final Cut Pro — step-by-step import guide for each editor
Try it
Choose your model at transcribe.so/transcribe. Upload a file or paste a YouTube URL, pick your pipeline, and get results in minutes. All plans include every AI feature — no per-feature upsells.