Best open-source speech-to-text models

Best open-source speech-to-text models

Best open-source speech-to-text models
Published on
Apr 2024

Automatic speech recognition, also known as speech-to-text (STT), has been around for some decades, but the advances of the last two decades in both hardware and software, especially for artificial intelligence, made the technology more robust and accessible than ever before.

The advent of open-source STT models has significantly democratized access to advanced ASR capabilities. Today, these models can offer customizable, and cost-effective solutions for integrating speech recognition into applications. 

Developers can then benefit easily and tremendously from this modern technology, tailoring it to specific use cases without the constraints of proprietary software licenses – and even contribute to its evolution. Among the many alternatives available for building speech-powered apps, which would be the right one for your needs?

In this article, we will cover the most advanced open-source ASR models available, including Whisper ASR, DeepSpeech, Kaldi, Wav2vec, or SpeechBrain, highlighting their key strength and technical requirements,

What are speech-to-text models?

Modern ASR can very reliably transcribe spoken words into digital text format allowing for easier analysis, storage, and manipulation of audio data for a wide range of applications across industries, like telecommunications, healthcare, education, customer service, and entertainment. 

Most leading ASR models today are built around an encoder-decoder architecture. The encoder extracts auditory features from the input, and the decoder reconstructs the features into a natural language sequence. To learn more, head to this deep-dive into how modern and legacy ASR models work.

Leveraging this architecture, these models enable near-human-level transcription of audio and video recordings such as interviews, meetings, and lectures, even in real-time; it facilitates the conversion of voice queries or commands into actionable data, enhancing user experience and efficiency in customer service applications.

More broadly, ASR systems are instrumental in developing voice-controlled applications, virtual assistants, and smart devices, enabling hands-free interaction via voice-driven commands. 

Best speech-to-text open source models

In selecting the best open-source speech-to-text models for enterprise use, we tried to go for accurate yet performant (i.e. functioning well in real-life scenarios) models, as well as development toolkits of high flexibility, customizability and integrability. It was important in our selection to observe good community support and development, trying to avoid “dead” projects.

1. Whisper ASR

Whisper is an open-source speech recognition system from OpenAI, trained on a large and diverse dataset of 680,000 hours of multilingual and multitasking supervised data collected from the web. Whisper can transcribe speech in English and in several other languages, and can also directly translate from several non-English languages into English.

Whisper uses an end-to-end approach based on an encoder-decoder transformer architecture, splitting audio into 30-second chunks that are converted into a log-Mel spectrogram and then passed into an encoder from which a decoder predicts the corresponding text. This text is actually intermixed with special tokens that direct the model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and translation to English. We have covered how Whisper works in more detail here, and have replied to some FAQs about Whisper here.

Whisper ASR architecture diagram
Whisper ASR. Credits: OpenAI

Considered widely as the best open-source ASR out there, Whisper has several strengths that make it a robust and useful speech recognition system.

First, its default accuracy is among the finest. It can handle various accents, background noise, and technical language, thanks to its large and diverse training data. It can also perform multiple tasks with a single model, such as transcribing and translating speech, which reduces the need for separate models and pipelines as you'd need with most other models – say if you want to transcribe text in French and translate it in real time into English. Moreover, Whisper can achieve high accuracy and performance on different speech domains and languages, even without additional fine-tuning.

On the downside, Whisper’s “vanilla” version as provided by OpenAI was intended as a research tool, and comes with some limitations that make it unsuitable for most projects requiring enterprise scale and versatility. The model comes with input limitations, doesn't include essential features like speaker diarization and word-level timestamps, and tends to hallucinate in a way that makes its output unsuitable for high-precision CRM enrichment and LLM-powered tasks. For more info on Whisper strengths and limitations, see this post on how we optimized Whisper for enterprises.

2. DeepSpeech

DeepSpeech is an open-source speech recognition system developed by Mozilla in 2017 and based on the homonymous algorithm by Baidu.

DeepSpeech uses a deep neural network to convert audio into text, and an N-gram language model to improve the accuracy and fluency of the transcription. Both modules were trained from independent data, to work as a transcriber coupled to a spelling and grammar checker. DeepSpeech can be used for both training and inference, and it supports multiple languages and platforms. Beyond its being multilingual, DeepSpeech runs with the advantage of being quite flexible, and in particular, retrainable.

That said, DeepSpeech comes with serious practical limitations compared to the state of the art like Whisper that came later. As discussed in Mozilla’s forums on DeepSpeech, its recordings are limited to just 10 seconds, limiting its use to applications such as command processing but no long transcriptions.

Besides, this limit also affects the text corpus, which results quite small at around 14 words / ~100 characters per sentence. Developers then have reported a need to split sentences and remove common words and sub-sentences to accelerate training. As of April 2024, there’s a move to extend the audio recording to 20 seconds, but even this seems somewhat far from what the state of the art offers.

Klu DeepSpeech
DeepSpeech. Credits: Klu

3. Wav2vec

Wav2vec, from the giant Meta, is a toolkit for speech recognition specialized in training with unlabeled data in an attempt to cover as much as possible of the language space covering languages that are poorly represented in the annotated datasets usually employed for supervised training. 

The motivation behind Wav2vec is that ASR technology is only available for a small fraction of the thousands of languages and dialects spoken around the world because traditional systems need to be trained on large amounts of speech audio annotated with transcriptions, which is impossible to obtain in sufficient amount for every possible form of speech.

To achieve its goal, wav2vec model is built around a self-supervised model trained to predict tiny (25 ms) units of masked audio as a token, akin to how large language models are trained to predict short syllable-like tokens but where the targets are units that correspond to individual sounds. Since the set of possible individual sounds is much smaller than that of syllable sounds, the model can focus on the building blocks of languages and “understand” more of them with a single processing core.

As studied by Meta’s AI team, the above-described unsupervised pretraining on audio transfers well across languages. Then, for the final step of linking audio processing to actual text, Wav2vec models need to undergo fine-tuning with labeled data. But at this stage, it requires approximately 2 orders of magnitude less audio-transcription pairs. 

Wav2Vec2 - A Lazy Data Science Guide
Wav2vec. Credits: Facebook AI

Purportedly, ASR systems trained in this way could outperform the best semi-supervised methods as of 2020, even with 100 times less labeled training data. While a more modern comparison to the new models would be desirable, this is still impressive and might find applications, especially as an open-source solution for processing audio from underrepresented languages.

Hands-on, you can either train Wav2vec models with custom-labeled or unlabeled data, or simply utilize their pre-made models, which already cover around 40 languages.

4. Kaldi

Kaldi is a toolkit for speech recognition written in C++, born out of the idea of having modern and flexible code that is easy to modify and extend. Importantly, the Kaldi toolkit attempts to provide its algorithms in the most generic and modular form possible, to maximize flexibility and reusability (even to other AI-based code outside Kaldi's own scope).

Kaldi. Credits: Nvidia

Kaldi is not exactly an out-of-the-box ASR system but rather helps developers to build speech recognition systems that work from widely available databases such as those provided by the Linguistic Data Consortium (LDC). As such, then, Kaldi-based ASR programs can be built to run on regular computers, on Android devices, and even inside web browsers via web assembly. The latter is probably somewhat limited, yet interesting because it could pave the way for fully cross-device compatible ASR systems built into web clients that don’t require any server-side processing at all.

5. SpeechBrain

SpeechBrain is an “all-in-one” speech toolkit. This means it’s not just doing ASR, but the whole set of tasks related to conversational AI: speech recognition, speech synthesis, large language models, and other elements required for natural speech-based interaction with a computer or a chatbot.

While Python and Pytorch are common in the ecosystem of OS ASR – for example, Whisper itself was trained on Pytorch – SpeechBrain was devised upfront as an open-source PyTorch toolkit aimed at allowing easier development of conversational AI.

SpeechBrain. Credits: idem

As opposed to most alternatives which, albeit open source, are mainly fostered by the private sector, SpeechBrain originates from a strong academic background from over 30 universities worldwide and counts with a large community of support. This community has shared over 200 competitive training recipes on more than 40 datasets supporting 20 speech and text processing tasks. Over 100 models pre-trained on HuggingFace can be easily plugged and utilized or fine-tuned.

Importantly, SpeechBrain supports both training from scratch and fine-tuning pretrained models such as OpenAI’s Whisper for ASR and GPT2 large language model, or Meta’s Wav2vec ASR model and its Llama2 large language model.

A downside of community contribution without much control is that the quality of many models might be questionable; therefore, extensive testing might be needed to ensure safe and scalable usage in enterprise environments. 

Open source speech-to-text: practical considerations

While open-source ASR models offer unparalleled flexibility and accessibility, deploying them comes with practical considerations that developers and organizations must carefully evaluate.

One significant factor to consider is the deployment cost, which encompasses various aspects such as hardware requirements, the need for AI expertise, and scaling limitations. Unlike proprietary solutions that may come with dedicated support and optimization, open-source models often require substantial computational resources for training and inference. In addition to that, some level of AI expertise is usually required to optimize the one-size-fits all open-source model to one's specific use case and needs.

Another important issue to consider is the fact that most, if not all, open-source models come with a limited feature set and presuppose further optimization work to/around their core architecture to make them suitable in enterprise environments. For companies in search of a hassle-free experience, open-source is hardly the route to take.

That's where specialized speech-to-text APIs come in handy: they come as all-batteries-included packages, with a range of pre-built features yet with enough room for customization, all in a form that allows you to forget about all the overheads, needs for computing infrastructure, certifications, and various hidden costs, at the same time as you get direct access to expert advice.

In short, the total cost of ownership of open-source models shouldn't be underestimated, and hybrid architectures powering APIs may actually give you a better ROI when it comes to embedding ASR into your apps. You can learn more about the pros and cons of open-source vs. speech-to-text APIs here.

Final remarks

We have here reviewed the best OS STT models for enterprise use, selected for accuracy and performance as well as for flexibility, customizability and integrability in software development pipelines, and counting with good community support.

Among the main solutions we have stressed examples of systems that are ready to use out-of-the-box, others that need tailored training, as well as those that do not offer complete STT packages but rather smaller pieces of the audio processing and transcription engines, giving you low-level access but probably less of a straightforward solution. Hopefully, this overview has given you a better idea of which model would make most sense for your specific needs.

Additional references

Deep Speech: Scaling up end-to-end speech recognition

wav2vec 2.0: A Framework for Self-SupervisedLearning of Speech Representations

Article written in collaboration with Luciano Abriata, PhD.

About Gladia

At Gladia, we built an optimized version of Whisper in the form of an API, adapted to real-life professional use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities, and state-of-the-art features, including speaker diarization and word-level timestamps.

Contact us

Your request has been registered
A problem occurred while submitting the form.

Read more


Should you trust WER?

Word Error Rate (WER) is a metric that evaluates the performance of ASR systems by analyzing the accuracy of speech-to-text results. WER metric allows developers, scientists, and researchers to assess ASR performance. A lower WER indicates better ASR performance, and vice versa. The assessment allows for optimizing the ASR technologies over time and helps to compare speech-to-text models and providers for commercial use. 


OpenAI Whisper vs Google Speech-to-Text vs Amazon Transcribe: The ASR rundown

Speech recognition models and APIs are crucial in building apps for various industries, including healthcare, customer service, online meetings, and entertainment.


Best open-source speech-to-text models

Automatic speech recognition, also known as speech-to-text (STT), has been around for some decades, but the advances of the last two decades in both hardware and software, especially for artificial intelligence, made the technology more robust and accessible than ever before.