Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Read more

Speech-To-Text

Mastering multilingual speech-to-text: handle code-switching with AI

The article explains why code-switching makes multilingual speech-to-text harder, especially when speakers switch languages mid-sentence or use accents in noisy environments.

Speech-To-Text

Best Whisper alternatives for 2026: Comparison of top speech-to-text APIs

The article compares the top Whisper alternatives for 2026 across accuracy, latency, pricing, features, and production readiness.

Speech-To-Text

Mastering CRM data enrichment: AI & speech-to-text for smarter leads

The article explains how AI and speech-to-text can enrich CRM records by turning sales calls into structured lead data like names, budgets, timelines, sentiment, and intent signals. It covers pipeline architecture, accuracy testing, compliance, cost planning, CRM integration, and production monitoring.

Best open-source speech-to-text models in 2026

Published on Apr 14, 2026
By Ani Ghazaryan
Best open-source speech-to-text models in 2026

TL;DR: The open-source ASR landscape has shifted dramatically in the last few years. DeepSpeech is discontinued, Kaldi is legacy, and a new generation of models — NVIDIA Canary-Qwen, Qwen3-ASR, Parakeet, and Moonshine — now compete with or surpass commercial APIs on standard accuracy benchmarks. But benchmark WER and production performance are not the same thing, especially for conversational audio. This guide covers the 8 best open-source speech-to-text models in 2026, with benchmarks, architecture details, and honest deployment considerations.

What are open-source speech-to-text models?

Open-source speech-to-text (STT) models — also called automatic speech recognition (ASR) models — convert spoken audio into written text. Modern ASR systems typically use encoder-decoder transformer architectures: the encoder extracts acoustic features from raw audio, and the decoder generates text sequences from those features.

Open-source models give developers full control over deployment, fine-tuning, and data privacy. But that flexibility comes with trade-offs: infrastructure costs, optimization work, and production features like speaker diarization, PII redaction, and real-time streaming that most open-source models don't include out of the box.

The Hugging Face Open ASR Leaderboard is the standard benchmark for comparing these models, ranking them by Word Error Rate (WER) and real-time factor (RTFx) across diverse datasets. Worth noting: most leaderboard datasets are read-speech or broadcast audio — not the messy, multi-speaker conversational audio that most production voice applications actually process. We'll flag this gap for each model where it's relevant.

Quick comparison: Top open-source STT models (2026)

Model Developer Parameters Avg WER (%) Languages Best For
NVIDIA Canary-Qwen 2.5B NVIDIA 2.5B 5.63 25 Highest benchmark accuracy
IBM Granite Speech 3.3 8B IBM ~9B 5.85 English + 7 translation Accuracy + speech translation
Qwen3-ASR 1.7B Alibaba / Qwen 1.7B Competitive with top commercial APIs 52 Multilingual breadth
Whisper Large V3 OpenAI 1.55B 7.4 99+ Multilingual coverage
Whisper Large V3 Turbo OpenAI 809M 7.75 99+ Speed vs. accuracy balance
NVIDIA Parakeet TDT 1.1B NVIDIA 1.1B ~8.0 English (+ multilingual variants) Maximum throughput
Moonshine Useful Sensors 27M–331M Comparable to Whisper Large V3 English (expanding) Edge / on-device
SpeechBrain Open community Varies (toolkit) Varies Multi Research / custom pipelines

1. OpenAI Whisper

The most widely adopted open-source ASR model: Whisper remains the default starting point for most developers. Trained on over 5 million hours of multilingual audio data (up from 680,000 hours in the original release), Whisper uses an end-to-end encoder-decoder transformer architecture that handles transcription, translation, language identification, and timestamp prediction in a single model.

What's new: Whisper Large V3 expanded training data by 635% compared to V2, achieving a 10–20% error reduction across languages according to OpenAI's published results. Whisper Large V3 Turbo prunes the decoder from 32 layers to 4, cutting parameters from 1.55B to 809M — the result is approximately 216x real-time processing speed with only a marginal WER increase (7.75% vs 7.4%). Distil-Whisper compresses Large V3 further to 756M parameters with WER within 1% of the full model and 5–6x faster inference. OpenAI also released GPT-4o-based transcription models in early 2025 that outperform all Whisper versions on benchmarks — but these are commercial API-only, not open source.

Strengths:

  • Broadest language support (99+ languages)
  • Strong accuracy across accents, noise, and technical vocabulary
  • Massive ecosystem: Hugging Face integrations, community fine-tunes, deployment tools
  • MIT license

Limitations:

  • Well-documented hallucination issues on silent or low-quality audio segments — a significant problem for long-form or noisy recordings
  • No built-in speaker diarization; requires bolting on a separate model such as pyannote.audio
  • On real conversational audio (crosstalk, overlapping speakers, accents), WER degrades meaningfully beyond what leaderboard numbers suggest
  • Requires significant GPU resources at full size (Large V3 needs ~10GB VRAM)
  • Not optimized for real-time streaming out of the box

Who should use it: Developers who need broad multilingual coverage and can tolerate batch processing latency. If you need English-only with maximum speed, Parakeet or Distil-Whisper are better choices. If conversational audio quality is the priority, test carefully against your actual data before committing.

2. NVIDIA Canary-Qwen 2.5B

Currently the #1 model on the Hugging Face Open ASR Leaderboard: Released in June 2025, Canary-Qwen 2.5B uses a Speech-Augmented Language Model (SALM) architecture that pairs a FastConformer encoder optimized for speech recognition with an unmodified Qwen3-1.7B LLM decoder. It tops the Open ASR Leaderboard with a 5.63% average WER.

Key specs:

  • Architecture: FastConformer encoder + Qwen3-1.7B LLM decoder
  • Training data: 234,000 hours of English audio
  • LibriSpeech WER: 1.6% (clean) / 3.1% (other)
  • Noise tolerance: 2.41% WER at 10 dB SNR
  • License: CC-BY-4.0
  • Supported languages: 25 (via the Canary-1b-v2 family)

Strengths:

  • Highest accuracy among open-source models on standard benchmarks
  • Strong noise robustness as measured on benchmark datasets
  • Inference up to 10x faster than similarly accurate models
  • Requires NVIDIA NeMo framework — well-maintained and production-grade

Limitations:

  • Higher VRAM requirements than smaller models
  • Language coverage (25) is narrower than Whisper (99+) or Qwen3-ASR (52)
  • Training data is primarily English read-speech and broadcast audio; performance on multi-speaker conversational audio has less published evidence than benchmark numbers imply
  • CC-BY-4.0 license requires attribution

Who should use it:Teams prioritizing transcription accuracy above all else, especially for English and European languages. Strong choice when you have GPU infrastructure and primarily process clean or semi-clean audio. Test against your own conversational data before assuming leaderboard WER transfers.

3. Qwen3-ASR

The newest entrant and the widest multilingual open-source ASR model by language count. Released in January 2026 by Alibaba's Qwen team, Qwen3-ASR supports 52 languages and dialects with language identification, speech recognition, and timestamp prediction. Built on the Qwen3-Omni foundation model, it comes in two sizes: 1.7B and 0.6B parameters. As of mid-2026, ecosystem and community tooling is still maturing relative to Whisper or NVIDIA's NeMo stack.

Key specs

  • Variants: Qwen3-ASR-1.7B (accuracy-optimized) and Qwen3-ASR-0.6B (efficiency-optimized)
  • Languages: 52 languages and dialects
  • Throughput: 0.6B variant achieves 2,000x throughput at concurrency of 128
  • Additional models: Qwen3-ForcedAligner-0.6B for text-speech alignment in 11 languages
  • License: Apache 2.0

Strengths

  • Broadest language and dialect support among non-Whisper models
  • The 1.7B variant is competitive with top commercial APIs on standard benchmarks
  • Apache 2.0 license — fully permissive for commercial use
  • Includes forced alignment capabilities, rare in open-source ASR

Limitations

  • Still relatively new — fewer production case studies and community deployment guides than Whisper
  • Performance on conversational and noisy audio is still emerging; benchmark results are primarily on clean datasets
  • Requires validation against your specific audio type before production use

Who should use it: Teams building multilingual products who need broader language coverage than NVIDIA models offer, and who have the engineering capacity to validate the model against real data. The 0.6B variant is compelling for high-throughput batch processing.

4. NVIDIA Parakeet TDT

The speed champion — designed for maximum throughput: NVIDIA's Parakeet TDT models prioritize inference speed for real-time and high-volume transcription. The 1.1B parameter variant achieves an RTFx above 2,000 — among the fastest models on the Open ASR Leaderboard. The Parakeet-TDT-0.6B-v3 extends this to multilingual support across 25 European languages.

Key specs

  • Variants: 1.1B (English) and 0.6B-v3 (25 European languages)
  • RTFx: >2,000 (1.1B variant)
  • WER: ~8.0% average on Open ASR benchmarks
  • VRAM: ~4GB (1.1B variant)
  • License: CC-BY-4.0
  • Training data: Built on NVIDIA's Granary dataset (~1M hours, 25 European languages)

Strengths

  • Fastest open-source ASR model available
  • Low VRAM footprint for its accuracy class
  • Multilingual v3 variant covers major European languages
  • Part of the well-supported NVIDIA NeMo ecosystem

Limitations

  • English-only in the highest-speed variant
  • WER is higher than Canary-Qwen or Granite Speech
  • CC-BY-4.0 license requires attribution
  • Like most open-source models, lacks built-in diarization — critical for call center and meeting use cases

Who should use it: Teams processing large volumes of audio where speed and cost-per-hour matter more than squeezing the last percentage point of accuracy. Well-suited for subtitle generation and high-throughput batch pipelines on single-speaker or clean audio.

5. IBM Granite Speech 3.3

A speech-language model that combines ASR with translation capabilities: IBM's Granite Speech 3.3 is built by modality-aligning and LoRA fine-tuning the Granite 3.3 Instruct model for speech. The 8B variant ranks near the top of the Open ASR Leaderboard at 5.85% average WER, while also supporting speech-to-text translation across 8 languages.

Key specs

  • Variants: 8B (accuracy) and 2B (efficiency); Granite 4.0 1B Speech also available
  • WER: 5.85% average (8B variant)
  • Languages: English ASR + translation to French, Spanish, Italian, German, Portuguese, Japanese, Mandarin
  • License: Apache 2.0

Strengths

  • Top-tier accuracy, second only to Canary-Qwen on the Open ASR Leaderboard
  • Integrated speech translation — not just transcription
  • Apache 2.0 — fully open for commercial use
  • Granite 4.0 1B Speech variant for resource-constrained deployments

Limitations

  • Large model size (8B) demands significant GPU resources
  • Primary strength is English; multilingual ASR coverage is limited compared to Whisper or Qwen3
  • Speech-language model architecture adds complexity compared to purpose-built ASR models

Who should use it: Teams that need both high-accuracy English transcription and multilingual translation in a single model, and have the GPU budget for an 8B+ parameter model.

6. Wav2Vec 2.0 / XLS-R

The self-supervised learning pioneer, still relevant for low-resource languages. Meta's Wav2Vec 2.0 introduced self-supervised pre-training to ASR — learning speech representations from unlabeled audio, then fine-tuning with minimal labeled data. The XLS-R variant scales to 128 languages using 436,000 hours of unlabeled speech across 2 billion parameters.

Key specs

  • Architecture: Self-supervised contrastive learning + transformer encoder
  • XLS-R coverage: 128 languages, 436,000 hours of pre-training data
  • Parameters: 95M (base) to 2B (XLS-R)
  • Key advantage: Requires approximately 100x less labeled training data than fully supervised methods
  • Downloads: 1.37M+ on Hugging Face

Strengths

  • Best option for low-resource and underrepresented languages where labeled data is scarce
  • Proven in specialized domains: medical ASR, accent-specific models, language preservation
  • Extensive research backing and well-understood architecture

Limitations

  • Raw WER on standard benchmarks is significantly higher than newer models (37% clean / 55% noisy on LibriSpeech without fine-tuning)
  • Requires fine-tuning for production use — not a plug-and-play solution
  • No built-in language model or decoder
  • Architecture is showing its age compared to 2025–2026 releases

Who should use it: Researchers and teams working with low-resource languages, specialized domains, or scenarios where labeled training data is extremely limited. Not recommended as a general-purpose ASR model in 2026 when newer alternatives exist.

7. Moonshine

Purpose-built for edge and on-device deployment: Moonshine by Useful Sensors is designed for real-time speech recognition on resource-constrained hardware — from Raspberry Pis to mobile devices. The smallest model is just 27MB, making it deployable where Whisper and NVIDIA models can't run. Moonshine v2 introduces an Ergodic Streaming Encoder specifically designed for latency-critical applications.

Key specs

  • Variants: Tiny (27MB), Base, and larger variants
  • Training data: 200,000 hours of audio
  • Platforms: Python, iOS, Android, macOS, Linux, Windows, Raspberry Pi, IoT, wearables
  • Languages: English (with community variants for Korean, Vietnamese expanding)

Strengths

  • Runs on-device without cloud connectivity — full privacy
  • Tiny footprint: deployable on IoT devices and wearables
  • No API keys, accounts, or cloud dependencies required
  • Larger variants claim accuracy comparable to Whisper Large V3

Limitations

  • Primarily English — multilingual support is community-driven and limited
  • Smaller models trade accuracy for size
  • Smaller community and ecosystem compared to Whisper or NeMo
  • Less battle-tested in large-scale production

Who should use it: Developers building voice features for edge devices, IoT products, wearables, or any application where on-device processing and privacy are non-negotiable.

8. SpeechBrain

Not a model — a full speech AI toolkit: SpeechBrain is an open-source PyTorch-based toolkit for building speech processing pipelines. Rather than providing a single model, it offers training recipes, pre-trained models, and modular components for ASR, speaker diarization, speech enhancement, text-to-speech, and more.

What's New in v1.1.0 (March 2026): Version 1.1.0 introduced refactored transformer support compatible with any HuggingFace model including LLMs, new data augmentation techniques (CodecAugment, ChannelDrop, CutCat), native fp16/bf16 mixed precision and multi-GPU training via torchrun, and a lazy-loading module architecture for faster imports. The project now includes 200+ training recipes and 100+ pre-trained models on HuggingFace.

Key specs

  • Community: 30+ universities contributing
  • Supports fine-tuning: Whisper, Wav2Vec 2.0, and LLMs (GPT-2, Llama)
  • License: Apache 2.0

Strengths

  • Maximum flexibility for building custom ASR pipelines
  • Combines ASR with diarization, speech enhancement, and TTS in one framework
  • Active community with regular releases
  • Excellent for research and experimentation

Limitations

  • Not a drop-in solution — requires ML expertise to configure and train
  • Quality varies across community-contributed models and recipes
  • Steeper learning curve than using a pre-trained model directly

Who should use it: Research teams and ML engineers building custom speech processing pipelines who need fine-grained control over every component. The right choice for combining multiple speech tasks (ASR + diarization + enhancement) in a single framework.

How to choose the right open-source STT model

If you're prioritizing… Start here
Highest benchmark accuracy NVIDIA Canary-Qwen 2.5B (5.63% WER) or IBM Granite Speech 3.3 8B (5.85% WER)
Multilingual breadth Whisper Large V3 (99+ languages) or Qwen3-ASR (52 languages)
Maximum throughput / speed NVIDIA Parakeet TDT (RTFx >2,000) or Whisper Large V3 Turbo
Edge / on-device deployment Moonshine (27MB minimum, runs on Raspberry Pi)
Low-resource language fine-tuning Wav2Vec 2.0 / XLS-R
Custom pipeline with full control SpeechBrain
ASR + speech translation combined IBM Granite Speech 3.3

One important caveat on all of the above: these recommendations are based on standard benchmark performance. If your use case involves conversational audio — multiple speakers, overlapping speech, background noise, code-switching between languages — you should run your own evaluation on real audio before making a final decision. Benchmark-to-production gaps are largest in exactly these scenarios.

The hidden costs of self-hosting open-source ASR

Open-source models are free to download. Production deployment is not.

GPU infrastructure is the most visible cost. Large models like Canary-Qwen and Granite Speech 8B require significant VRAM. Whisper Large V3 alone needs ~10GB VRAM per instance. At $2–4/hour for a capable GPU instance, running a modest transcription service at scale adds up quickly — before you've written a line of application code.

Missing production features are the hidden ones. Most open-source ASR models ship without speaker diarization, PII redaction, real-time streaming, punctuation restoration, or compliance infrastructure. Building diarization alone — integrating pyannote.audio, aligning speaker segments with transcript timestamps, handling speaker overlaps — typically takes weeks of engineering. For context, Gladia's guide to speaker diarization walks through exactly what that implementation involves. PII redaction and compliance infrastructure (GDPR, HIPAA, SOC 2) add further scope.

Optimization engineering is the gap between "runs on my laptop" and "handles production load." Getting from a working local inference setup to a service that handles concurrent requests with stable latency requires load balancing, request batching, inference optimization (quantization, TensorRT export, ONNX), and autoscaling. For most teams, this is measured in engineer-months, not days.

Real-world accuracy gaps are often the final surprise. Leaderboard WER is measured on clean datasets — read speech, broadcast audio, controlled recording conditions. Real conversational audio with crosstalk, background noise, accents, and mid-sentence language switches performs significantly worse. For a detailed breakdown of how benchmark and real-world performance diverge, see Gladia's analysis of ASR benchmarks vs. production accuracy.

Maintenance overhead is the ongoing cost. Models get updated, dependencies break, GPU driver versions conflict. In a fast-moving field — where the #1 leaderboard model changed multiple times between 2024 and mid-2026 — keeping a self-hosted stack current is a real engineering commitment.

For teams whose core product is voice infrastructure, this overhead is worth it. For teams building products that use voice as a feature, the total cost of ownership of a self-hosted stack often exceeds the cost of a managed API — especially once compliance and diarization requirements enter the picture.

When a managed API makes more sense

Self-hosting is the right call when you need full pipeline control, have specific fine-tuning requirements, must keep audio on-premise, or are building at a scale where per-minute API costs exceed infrastructure costs.

For most product teams, those conditions aren't met in the early stages. A managed API removes infrastructure and compliance overhead so engineering time can go toward the actual product. The relevant question isn't "managed vs. self-hosted" in the abstract — it's whether your team's time is better spent on ASR infrastructure or on the product that uses ASR.

If you're evaluating managed options, the factors that matter most are performance on real conversational audio (not just WER on clean benchmarks), built-in diarization quality, code-switching support, and compliance infrastructure. Those are exactly the tradeoffs Gladia's Speech-to-Text API was built around.

For teams building meeting assistants, contact center platforms, or voice agents, the free tier includes 10 hours of processing with all features enabled — enough to run your actual production audio through the API and measure what matters before committing any engineering sprints.

Frequently asked questions

What is the most accurate open-source speech-to-text model in 2026?

NVIDIA Canary-Qwen 2.5B currently holds the #1 position on the Hugging Face Open ASR Leaderboard with a 5.63% average WER. IBM Granite Speech 3.3 8B is a close second at 5.85% WER. Both numbers are measured on standard benchmark datasets; performance on noisy conversational audio varies.

Is OpenAI Whisper still the best open-source ASR model?

Whisper remains the most widely used open-source ASR model and the best choice for multilingual transcription (99+ languages). It is no longer the most accurate on benchmarks — NVIDIA Canary-Qwen, IBM Granite Speech, and Qwen3-ASR all achieve lower WER on English. Whisper's continued dominance comes from its ecosystem, language coverage, MIT license, and the sheer volume of community tooling built around it.

What happened to Mozilla DeepSpeech?

Mozilla formally discontinued DeepSpeech in June 2025 and archived the GitHub repository. The last meaningful release was in 2020. Teams still running DeepSpeech in production should plan migration to Whisper, Moonshine, or NVIDIA Parakeet depending on their use case.

Is Kaldi still relevant in 2026?

Kaldi remains in production at some large organizations, and Kaldi 2.0 (k2) continues development combining neural end-to-end models with WFST decoding. For new projects, modern end-to-end models offer better accuracy with significantly less setup complexity. Kaldi is best suited for teams with existing infrastructure or very specific domain adaptation requirements.

Which open-source STT model is best for real-time transcription?

NVIDIA Parakeet TDT 1.1B offers the highest throughput at RTFx above 2,000. For edge and on-device real-time processing, Moonshine v2 with its Ergodic Streaming Encoder is designed specifically for latency-critical applications. Whisper Large V3 Turbo provides strong real-time performance at 216x real-time speed with broader language coverage.

Can open-source STT models do speaker diarization?

Most open-source ASR models — including Whisper, Canary-Qwen, and Parakeet — do not include built-in speaker diarization. Integrating diarization requires adding a separate model (such as pyannote.audio), aligning speaker segments with transcript output, and handling edge cases like speaker overlap. SpeechBrain provides a framework for building this pipeline. For production use cases where diarization quality matters (contact center, meeting intelligence), the integration complexity is worth factoring into your build-vs-buy calculation.

What is the best open-source model for non-English languages?

For sheer language count, Whisper Large V3 supports 99+ languages. Qwen3-ASR covers 52 languages and dialects with competitive accuracy. For fine-tuning on low-resource languages with limited labeled data, Wav2Vec 2.0 / XLS-R (128 languages) remains the strongest foundation model for custom fine-tuning.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more