Every ASR Model on Transcribe.so: Benchmarks, Pricing, and When to Use Each

Transcribe.so
ASRspeech to textGPT-4o transcriptionQwen3-ASR-FlashElevenLabs ScribeGoogle GeminiMistral VoxtralAmazon TranscribeWER benchmarktranscription accuracyAI transcription

Why we support multiple ASR models

There is no single best transcription model. The right model depends on your content — how many speakers, what language, how long the recording, and whether you need word-level timestamps or speaker labels.

That's why Transcribe.so lets you choose your ASR pipeline per transcription. Today we support three world-class models, with three more coming soon. Every model feeds into the same downstream AI pipeline: topics, chapters, summaries, semantic search, Q&A with citations, and subtitle export.

Here's the full breakdown.

Currently supported models

GPT-4o Transcribe Diarize

Provider: OpenAI Best for: Multi-speaker content where "who said what" matters

GPT-4o Transcribe Diarize is OpenAI's premium transcription model with built-in speaker identification — a capability no other single-API model matches at this quality level. If your audio has multiple speakers, this is the model to use.

SpecDetail
Speaker diarizationYes (automatic speaker labels)
Languages57
Timestamp typeSegment-level with speaker attribution
Max audio durationUnlimited (chunked processing)
Word-level timestampsNo (segment-level)
Emotion detectionNo

Pricing on Transcribe.so:

TierRate
Free$3.88/hr
Basic ($12/mo)$3.61/hr
Plus ($39/mo)$3.45/hr
Pro ($99/mo)$3.18/hr

When to choose GPT-4o Diarize:

  • Podcasts, interviews, meetings, panel discussions
  • Any content where speaker labels are essential
  • Multi-speaker audio where you need to know who said what

Qwen3-ASR-Flash

Provider: Alibaba Qwen Best for: Maximum accuracy, word-level timestamps, long-form audio, Chinese dialects

Qwen3-ASR-Flash is ranked #1 on the HuggingFace Open ASR Leaderboard with a 4.25% average Word Error Rate — nearly 2x better than Whisper-large-v3.

SpecDetail
Speaker diarizationNo
Languages52 + 22 Chinese dialects
Timestamp typeSentence + word-level (10 languages)
Max audio duration12 hours native (no chunking)
Word-level timestampsYes
Emotion detectionYes

Pricing on Transcribe.so:

TierRate
Free$1.71/hr
Basic ($12/mo)$1.59/hr
Plus ($39/mo)$1.52/hr
Pro ($99/mo)$1.40/hr

For a detailed deep-dive on Qwen3-ASR-Flash, see the launch announcement.

When to choose Qwen3-ASR-Flash:

  • Single-speaker content (lectures, audiobooks, webinars)
  • Subtitle generation (word-level timestamps enable precise cue boundaries)
  • Long-form audio (3+ hours) — 12-hour native support means no chunking artifacts
  • Chinese dialect content (Cantonese, Sichuanese, Fujian, and 19 more)
  • When you want the lowest WER available

Benchmark comparison: Open ASR Leaderboard

The HuggingFace Open ASR Leaderboard is the most widely used community benchmark for speech-to-text models. It evaluates models across 9 diverse test sets and reports average Word Error Rate (WER). Lower is better.

Qwen3-ASR-Flash vs other top models

DatasetQwen3-ASR-FlashNVIDIA Canary-1BWhisper-large-v3Whisper-large-v3-turbo
LibriSpeech Clean1.61%~2.5%~2.7%~3.0%
LibriSpeech Other2.88%~5.0%~5.5%~6.0%
SPGISpeech2.06%~3.5%~4.0%~4.2%
Tedlium3.20%~5.5%~4.5%~5.0%
VoxPopuli6.39%~7.0%~8.5%~9.0%
Common Voice 97.42%~9.0%~10.0%~11.0%
GigaSpeech8.88%~10.0%~11.0%~11.5%
Earnings2210.68%~12.0%~14.0%~15.0%
AMI11.29%~15.0%~16.0%~17.0%
Average WER4.25%~7.5%~8.0%~8.5%

Qwen3-ASR-Flash leads on every single benchmark dataset.

Artificial Analysis rankings (AA-WER v2.0)

Artificial Analysis uses a different benchmark methodology (AA-AgentTalk 50%, VoxPopuli-Cleaned-AA 25%, Earnings22-Cleaned-AA 25%) and ranks models independently.

RankModelProviderAA-WER
1Scribe v2ElevenLabs2.3%
2Gemini 3 ProGoogle2.9%
3Voxtral SmallMistral3.0%
4Gemini 2.5 ProGoogle3.1%
5Gemini 3 FlashGoogle3.1%

A note on benchmark methodology: Qwen3-ASR-Flash is not yet listed on Artificial Analysis, and the two leaderboards use different test sets and scoring. Direct WER numbers aren't comparable across leaderboards — a model scoring 4.25% on the Open ASR Leaderboard's 9-dataset average isn't necessarily "worse" than one scoring 2.3% on Artificial Analysis's 3-dataset composite. What matters is that both leaderboards identify the top-performing models, and we plan to support the best from each.

Voxtral Mini Transcribe

Provider: Mistral AI Best for: Word-level timestamps, subtitle generation, budget-friendly transcription

Voxtral Mini Transcribe is Mistral AI's dedicated transcription model with word-level timestamps and speaker diarization across 40 languages. At $0.003/min for transcription, it's the most cost-effective option with word-level precision.

SpecDetail
Speaker diarizationYes
Languages40
Timestamp typeSentence + word-level (all languages)
Context biasingYes — up to 100 custom terms
Word-level timestampsYes
AA-WER3.0% (Voxtral Small)

When to choose Voxtral Mini Transcribe:

  • Subtitle generation where every word needs precise timing
  • Budget-conscious transcription — lowest transcription cost per minute
  • Content with proper nouns or technical terms (context biasing helps accuracy)
  • Multi-speaker content requiring both diarization and word timestamps

Coming soon

We're adding three more ASR pipelines. Each will be available as an additional option in the pipeline selector, with the same downstream AI analysis (topics, chapters, search, Q&A, subtitles).

ElevenLabs Scribe v2

Provider: ElevenLabs AA-WER: 2.3% — #1 on Artificial Analysis

SpecDetail
Speaker diarizationYes
Languages99
TimestampsWord-level
LatencyLow (optimized for real-time)
NotableHighest accuracy on Artificial Analysis, supports audio events and sound detection

Why we're adding it: Scribe v2 tops the Artificial Analysis leaderboard with the lowest WER of any model tested. Combined with 99-language support and speaker diarization, it could be the best all-around option for many use cases.

Google Gemini

Provider: Google DeepMind AA-WER: 2.9% (Gemini 3 Pro) / 3.1% (Gemini 3 Flash)

SpecDetail
Speaker diarizationVaries by model
Languages100+
TimestampsVaries
Context windowUp to 1M tokens (audio native)
NotableMultimodal — can process audio natively alongside text and video

Why we're adding it: Gemini's multimodal architecture processes audio natively rather than converting to text through a separate ASR pipeline. The long context window means entire recordings can be processed in a single pass, and Google's models consistently rank in the top 5 on Artificial Analysis.

Amazon Transcribe

Provider: AWS

SpecDetail
Speaker diarizationYes
Languages100+
Custom vocabularyYes (domain-specific terms)
Custom language modelsYes
NotableEnterprise-grade with HIPAA eligibility, PCI DSS compliance, custom vocabulary for domain-specific accuracy

Why we're adding it: Amazon Transcribe is the enterprise choice. Custom vocabulary support means medical, legal, and technical content gets domain-specific accuracy improvements that general models can't match. AWS compliance certifications make it suitable for regulated industries.

Model selection guide

Current models

Use caseRecommendedWhy
Multiple speakers (podcast, meeting, interview)GPT-4o DiarizeBuilt-in speaker labels — see the podcast transcription guide for show notes best practices
Single speaker, maximum accuracyQwen3-ASR-Flash#1 WER on Open ASR Leaderboard
Subtitle generationQwen3-ASR-FlashWord-level timestamps for precise cue boundaries — see the subtitle export comparison
Chinese dialectsQwen3-ASR-Flash22 dialect support
Long-form audio (3+ hours)Qwen3-ASR-Flash12-hour native, no chunking. Longer audio also benefits from automatic chapter generation
Budget-consciousQwen3-ASR-Flash$1.71/hr vs $3.88/hr
Meeting transcription with speaker IDsGPT-4o DiarizeAutomatic speaker identification

When upcoming models arrive

Use caseRecommendedWhy
Best overall accuracy + diarizationElevenLabs Scribe v22.3% WER + speaker labels + 99 languages
Multimodal / video+audio analysisGoogle GeminiNative audio understanding in multimodal context
Open-source preferenceMistral VoxtralBest open-weight ASR (3.0% WER)
Enterprise / regulated industryAmazon TranscribeHIPAA, custom vocabulary, compliance certifications
Maximum language coverageElevenLabs Scribe v2 or Google Gemini99-100+ languages

How pricing works

Every model on Transcribe.so follows the same pricing structure: pay-per-minute with no subscription lock-in. Subscription tiers reduce the per-minute rate.

The transcription cost varies by model, but the downstream AI pipeline (GPT-4.1 for analysis, text-embedding-3-large for semantic search, infrastructure) is shared across all models.

ComponentGPT-4o PipelineQwen3 Pipeline
Transcription API$1.80/hr$0.13/hr
LLM analysis (GPT-4.1)$0.48/hr$0.48/hr
Embeddings$0.06/hr$0.06/hr
Infrastructure$1.00/hr$1.00/hr
Provider total$3.34/hr$1.67/hr

Upcoming models will have their own transcription API rates, but the shared pipeline cost stays the same.

What every model gets

Regardless of which ASR model you choose, every transcription on Transcribe.so gets the same AI enrichment:

  • Topic detection and keyword extraction
  • Chapter generation with titles and summaries
  • Semantic search across your transcript library (3072-dimensional embeddings)
  • AI Q&A with citations — ask questions, get answers with exact timestamps
  • AI summary with takeaways, key quotes, and speaker profiles
  • Subtitle export — SRT, VTT, karaoke VTT, and JSON with full constraint controls

The ASR model is the first step. Everything after it is the same pipeline.

Benchmarks and leaderboards

Two independent leaderboards track ASR model performance. We reference both when evaluating models:

  • HuggingFace Open ASR Leaderboard — Community benchmark using 9 diverse test sets (LibriSpeech, AMI, Earnings22, GigaSpeech, etc.). Reports average WER. Qwen3-ASR-Flash is #1.

  • Artificial Analysis — Speech-to-Text — Independent benchmark using AA-WER v2.0 methodology (AA-AgentTalk, VoxPopuli-Cleaned-AA, Earnings22-Cleaned-AA). Includes speed and pricing comparisons. ElevenLabs Scribe v2 is #1.

Different methodologies, different rankings — both valuable. We aim to support the top models from each.

Related

Try it

Choose your model at transcribe.so/transcribe. Upload a file or paste a YouTube URL, pick your pipeline, and get results in minutes. All plans include every AI feature — no per-feature upsells.

Ready to transcribe your own content?

No credit card required. Pay only for what you use.

See it in action

Real output from a real transcription

Browse chapters, ask questions, and explore search results from an actual transcript.

44 Harsh Truths About The Game Of Life - Naval Ravikant (4K)
Chris Williamson
Contents
8 chapters · 513 topics
1Happiness Versus Success: Philosophical Reflections on Contentment, Desire, and Motivation
T1Happiness Versus Success: A Personal Reflection
T2Freedom Through Non-Desire: Socratic Wisdom
T3Alexander and Diogenes: Two Paths to Happiness
T4Defining Success and Its Relation to Happiness
T5Happiness and Motivation: A Practical Dilemma
T6Innate Drive to Act Despite Contentment
T7Happiness Enabling Higher Purpose and Action
T8Rejecting Asceticism: Lessons from Buddha's Journey
T9Choosing Material Success for Happiness
T10Winning the Game to Transcend Desire
T11Short-Term Suffering for Long-Term Gain
T12Attaching Satisfaction to Pain Versus Outcomes
T13Distinguishing Physical Pain from Mental Suffering
T14Regret Over Not Enjoying the Journey
T15Reflecting on Past Life Stages
T16Gaining Wisdom from Self-Reflection
T17Applying Temperament and Experience in Hindsight
T18The Value of Retrospective Self-Assessment
T19Choosing Less Emotional Turmoil in the Past
T20Effectiveness Through Emotional Peace
T21The Journey Matters More Than Success
T22The Endless Cycle of Desire and Boredom
T23Earning Money Brings Pride and Happiness
T24Money Solves Problems, But Not Desire
T25Enjoying the Journey Is Essential
T26Minimizing Desires to Increase Happiness
T27Focus and Selectivity Lead to Success
T28The Mixed Value of Fame
T29Fame’s Social and Status Benefits
T30The High Costs and Contradictions of Fame
T31Fame Across History: Spiritual, Artistic, Scientific Icons
T32Conquerors and the Complexity of Historical Fame
T33Public Proclamations and Evolving Beliefs
T34The Pressure of Public Persona Versus Private Life
T35Learning Through Error Correction and Changing Views
T36Human Nature: Constant Change and Growth
T37Authenticity Versus Public Image and Social Perception
T38Being Wrong Versus Being Disingenuous
T39Seeking Respect: Authenticity Over Mass Approval
T40Status Games and Social Approval: Overcoming Distraction
T41Status Versus Wealth in Hunter-Gatherer Societies
T42Modern Wealth Creation and Positive-Sum Games
T43Collective Wealth Growth Since Ancient Times
T44The Zero-Sum Nature of Status Games
T45Combative Status Games Versus Cooperative Wealth Creation
T46Material Benefits of Wealth Over Status
T47Unprecedented Opportunities for Wealth Creation Today
T48Effort and Skill Still Required for Wealth
T49Increased Social Mobility Compared to the Past
T50Prioritizing Wealth Creation Over Status Seeking
T51Wealth, Status, and Human Motivation
T52Understanding Wealth Beyond Survival Needs
T53Status Versus Wealth: The Never-Ending Game
T54Leaderboards and the Infinite Status Race
T55Social Media and Constant Status Comparison
T56Metrics and the Status Treadmill
T57Trajectory Versus Position in Status
T58Evolutionary Roots of Loss Aversion
T59Innate Reluctance to Surrender Gains
2Optimizing Sleep: Smart Temperature Regulation and the Foundations of Self-Esteem
3Decisive Action and Iterative Practice: Keys to Optimal Choices and Mastery
4Wealth Management: From Materialism to Value Creation and Fair Compensation
5Evaluating LLMs: Capabilities, Limitations, and Their Role in AI's Evolving Landscape
6Pathogens, Evolution, and Knowledge: How Humans Adapt and Defend
7Agency, Power, and the Individual: From Child Development to Cultural Conflict
8Unseen Trends: Media Oversights, Medical Limitations, and the Primitive State of Modern Biology
Q&A preview
Answer
Naval explains two distinct paths to happiness using the story of Alexander and Diogenes. The first path is through success—conquering the world, satisfying material needs, and getting what you want. The second path, exemplified by Diogenes living in a barrel, is simply not wanting in the first place. As Socrates said when shown luxuries: 'How many things there are in this world that I do not want.' Naval suggests not wanting something is as good as having it—both paths lead to the same destination of contentment [00:38–01:10]. He's not sure which path is more valid, noting it depends on how you define success [01:10–01:25].

Command Palette

Search for a command to run...

No credit card required. Pay only for what you use.