Automatic speech recognition, also known as speech-to-text (STT), has been around for some decades, but the advances of the last two decades in both hardware and software, especially for artificial intelligence, made the technology more robust and accessible than ever before.
The advent of open-source STT models has significantly democratized access to advanced ASR capabilities. Today, these models can offer customizable, and cost-effective solutions for integrating speech recognition into applications.
Developers can then benefit easily and tremendously from this modern technology, tailoring it to specific use cases without the constraints of proprietary software licenses – and even contribute to its evolution. Among the many alternatives available for building speech-powered apps, which would be the right one for your needs?
In this article, we will cover the most advanced open-source ASR models available, including Whisper ASR, DeepSpeech, Kaldi, Wav2vec, or SpeechBrain, highlighting their key strength and technical requirements,
What are speech-to-text models?
Modern ASR can very reliably transcribe spoken words into digital text format allowing for easier analysis, storage, and manipulation of audio data for a wide range of applications across industries, like telecommunications, healthcare, education, customer service, and entertainment.
Most leading ASR models today are built around an encoder-decoder architecture. The encoder extracts auditory features from the input, and the decoder reconstructs the features into a natural language sequence. To learn more, head to this deep-dive into how modern and legacy ASR models work.
Leveraging this architecture, these models enable near-human-level transcription of audio and video recordings such as interviews, meetings, and lectures, even in real-time; it facilitates the conversion of voice queries or commands into actionable data, enhancing user experience and efficiency in customer service applications.
More broadly, ASR systems are instrumental in developing voice-controlled applications, virtual assistants, and smart devices, enabling hands-free interaction via voice-driven commands.
Best speech-to-text open source models
In selecting the best open-source speech-to-text models for enterprise use, we tried to go for accurate yet performant (i.e. functioning well in real-life scenarios) models, as well as development toolkits of high flexibility, customizability and integrability. It was important in our selection to observe good community support and development, trying to avoid “dead” projects.
1. Whisper ASR
Whisper is an open-source speech recognition system from OpenAI, trained on a large and diverse dataset of 680,000 hours of multilingual and multitasking supervised data collected from the web. Whisper can transcribe speech in English and in several other languages, and can also directly translate from several non-English languages into English.
Whisper uses an end-to-end approach based on an encoder-decoder transformer architecture, splitting audio into 30-second chunks that are converted into a log-Mel spectrogram and then passed into an encoder from which a decoder predicts the corresponding text. This text is actually intermixed with special tokens that direct the model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and translation to English. We have covered how Whisper works in more detail here, and have replied to some FAQs about Whisper here.
Considered widely as the best open-source ASR out there, Whisper has several strengths that make it a robust and useful speech recognition system.
First, its default accuracy is among the finest. It can handle various accents, background noise, and technical language, thanks to its large and diverse training data. It can also perform multiple tasks with a single model, such as transcribing and translating speech, which reduces the need for separate models and pipelines as you'd need with most other models – say if you want to transcribe text in French and translate it in real time into English. Moreover, Whisper can achieve high accuracy and performance on different speech domains and languages, even without additional fine-tuning.
On the downside, Whisper’s “vanilla” version as provided by OpenAI was intended as a research tool, and comes with some limitations that make it unsuitable for most projects requiring enterprise scale and versatility. The model comes with input limitations, doesn't include essential features like speaker diarization and word-level timestamps, and tends to hallucinate in a way that makes its output unsuitable for high-precision CRM enrichment and LLM-powered tasks. For more info on Whisper strengths and limitations, see this post on how we optimized Whisper for enterprises.
2. DeepSpeech
DeepSpeech is an open-source speech recognition system developed by Mozilla in 2017 and based on the homonymous algorithm by Baidu.
DeepSpeech uses a deep neural network to convert audio into text, and an N-gram language model to improve the accuracy and fluency of the transcription. Both modules were trained from independent data, to work as a transcriber coupled to a spelling and grammar checker. DeepSpeech can be used for both training and inference, and it supports multiple languages and platforms. Beyond its being multilingual, DeepSpeech runs with the advantage of being quite flexible, and in particular, retrainable.
That said, DeepSpeech comes with serious practical limitations compared to the state of the art like Whisper that came later. As discussed in Mozilla’s forums on DeepSpeech, its recordings are limited to just 10 seconds, limiting its use to applications such as command processing but no long transcriptions.
Besides, this limit also affects the text corpus, which results quite small at around 14 words / ~100 characters per sentence. Developers then have reported a need to split sentences and remove common words and sub-sentences to accelerate training. As of April 2024, there’s a move to extend the audio recording to 20 seconds, but even this seems somewhat far from what the state of the art offers.
3. Wav2vec
Wav2vec, from the giant Meta, is a toolkit for speech recognition specialized in training with unlabeled data in an attempt to cover as much as possible of the language space covering languages that are poorly represented in the annotated datasets usually employed for supervised training.
The motivation behind Wav2vec is that ASR technology is only available for a small fraction of the thousands of languages and dialects spoken around the world because traditional systems need to be trained on large amounts of speech audio annotated with transcriptions, which is impossible to obtain in sufficient amount for every possible form of speech.
To achieve its goal, wav2vec model is built around a self-supervised model trained to predict tiny (25 ms) units of masked audio as a token, akin to how large language models are trained to predict short syllable-like tokens but where the targets are units that correspond to individual sounds. Since the set of possible individual sounds is much smaller than that of syllable sounds, the model can focus on the building blocks of languages and “understand” more of them with a single processing core.
Purportedly, ASR systems trained in this way could outperform the best semi-supervised methods as of 2020, even with 100 times less labeled training data. While a more modern comparison to the new models would be desirable, this is still impressive and might find applications, especially as an open-source solution for processing audio from underrepresented languages.
Hands-on, you can either train Wav2vec models with custom-labeled or unlabeled data, or simply utilize their pre-made models, which already cover around 40 languages.
4. Kaldi
Kaldi is a toolkit for speech recognition written in C++, born out of the idea of having modern and flexible code that is easy to modify and extend. Importantly, the Kaldi toolkit attempts to provide its algorithms in the most generic and modular form possible, to maximize flexibility and reusability (even to other AI-based code outside Kaldi's own scope).
Kaldi is not exactly an out-of-the-box ASR system but rather helps developers to build speech recognition systems that work from widely available databases such as those provided by the Linguistic Data Consortium (LDC). As such, then, Kaldi-based ASR programs can be built to run on regular computers, on Android devices, and even inside web browsers via web assembly. The latter is probably somewhat limited, yet interesting because it could pave the way for fully cross-device compatible ASR systems built into web clients that don’t require any server-side processing at all.
5. SpeechBrain
SpeechBrain is an “all-in-one” speech toolkit. This means it’s not just doing ASR, but the whole set of tasks related to conversational AI: speech recognition, speech synthesis, large language models, and other elements required for natural speech-based interaction with a computer or a chatbot.
While Python and Pytorch are common in the ecosystem of OS ASR – for example, Whisper itself was trained on Pytorch – SpeechBrain was devised upfront as an open-source PyTorch toolkit aimed at allowing easier development of conversational AI.
As opposed to most alternatives which, albeit open source, are mainly fostered by the private sector, SpeechBrain originates from a strong academic background from over 30 universities worldwide and counts with a large community of support. This community has shared over 200 competitive training recipes on more than 40 datasets supporting 20 speech and text processing tasks. Over 100 models pre-trained on HuggingFace can be easily plugged and utilized or fine-tuned.
Importantly, SpeechBrain supports both training from scratch and fine-tuning pretrained models such as OpenAI’s Whisper for ASR and GPT2 large language model, or Meta’s Wav2vec ASR model and its Llama2 large language model.
A downside of community contribution without much control is that the quality of many models might be questionable; therefore, extensive testing might be needed to ensure safe and scalable usage in enterprise environments.
Open source speech-to-text: practical considerations
While open-source ASR models offer unparalleled flexibility and accessibility, deploying them comes with practical considerations that developers and organizations must carefully evaluate.
One significant factor to consider is the deployment cost, which encompasses various aspects such as hardware requirements, the need for AI expertise, and scaling limitations. Unlike proprietary solutions that may come with dedicated support and optimization, open-source models often require substantial computational resources for training and inference. In addition to that, some level of AI expertise is usually required to optimize the one-size-fits all open-source model to one's specific use case and needs.
Another important issue to consider is the fact that most, if not all, open-source models come with a limited feature set and presuppose further optimization work to/around their core architecture to make them suitable in enterprise environments. For companies in search of a hassle-free experience, open-source is hardly the route to take.
That's where specialized speech-to-text APIs come in handy: they come as all-batteries-included packages, with a range of pre-built features yet with enough room for customization, all in a form that allows you to forget about all the overheads, needs for computing infrastructure, certifications, and various hidden costs, at the same time as you get direct access to expert advice.
We have here reviewed the best OS STT models for enterprise use, selected for accuracy and performance as well as for flexibility, customizability and integrability in software development pipelines, and counting with good community support.
Among the main solutions we have stressed examples of systems that are ready to use out-of-the-box, others that need tailored training, as well as those that do not offer complete STT packages but rather smaller pieces of the audio processing and transcription engines, giving you low-level access but probably less of a straightforward solution. Hopefully, this overview has given you a better idea of which model would make most sense for your specific needs.
Article written in collaboration with Luciano Abriata, PhD.
About Gladia
At Gladia, we built an optimized version of Whisper in the form of an API, adapted to real-life professional use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities, and state-of-the-art features, including speaker diarization and word-level timestamps.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Product News
Our Road to Real-Time Audio AI – with $16M in Series A funding
Real-time audio AI is transforming the way we work and build software. With instant insights from every call and meeting at their fingertips, customer support agents and sales reps will be able to reach new levels of efficiency and deliver a more delightful customer experience across borders.
Gladia selected to participate in the 2024 AWS Generative AI Accelerator
We’re proud to announce that Gladia has been selected for the second cohort of the AWS Generative AI Accelerator, a global program offering top early-stage startups that are using generative AI to solve complex challenges, learn go-to-market strategies, and access to mentorship and AWS credits.
How to implement advanced speaker diarization and emotion analysis for online meetings
In our previous article, we discussed how to unlock some of that data by building a speaker diarization system for online meetings (POC) to identify speakers in audio streams and provide organizations with detailed speaker-based insights into meetings, create meeting summaries, action items, and more.