A review of the best ASR engines and the models powering them in 2024
A review of the best ASR engines and the models powering them in 2024
Published on
Apr 2024
Automatic Speech Recognition (ASR), also known as speech-to-text or audio transcription, is a technology that converts spoken language stored in an audio or video file into written text.
ASR aims to smoothen the communication between computers and human users in two ways: by allowing computers to understand spoken commands and by transcribing text from speech-based sources – for example, upon dictation or from meeting recordings – with the goal of storing the transcripts in ways amenable to processing and displaying.
Although ASR has been around for decades, the real breakthrough that led to transcription becoming widely accessible took place in the last ten years, fuelled by the increasing availability of training data, the democratization of hardware costs, and the rise of deep learning models.
These factors enabled the development of powerful ASR systems able to power a variety of business applications such as commanding assistants, semantic searches, automated call bots, long text dictation, automatic captions in social media apps, note-taking tools in virtual meetings, and so on.
Here, we will review the modern ASR engines and models that power them, focusing on what we put forward as the most advanced systems in speech recognition today, leading the way to practically useful enterprise-grade speech recognition.
Brief history of ASR technology
ASR research began in the mid-20th century, marked by early attempts to use computers for language processing. Initial acoustic models struggled with accents, dialects, homophones (words that sound the same but have different meanings and are often spelled differently) and speech nuances such as topic-specific jargon, local expressions, etc.
As the field saw advancements in statistical models and symbolic natural language processing (NLP), some software vendors began experimenting with voice-based functionalities in their products. However, it wasn’t until the breakthrough in the 2010s with the rise of machine learning and the introduction of transformers – introduced in 2017 by a research group from Google –- that high-quality speech recognition was set on a path to true commodification.
By leveraging ‘attention mechanisms’, transformers enable the capture of long-range dependencies when processing input. In speech recognition, this means that the exact recognition of a word is assisted by the recognition of the previous and following words of a sentence or command, which in practice resounds in far better ‘contextualized’ – as opposed to purely acoustic-based – recognition of speech as a whole.
The integration of transformers into ASR architectures entailed a shift from mere speech recognition to broader language understanding, which developed hand in hand with AI models specialized for language processing. This language understanding includes not only transcribing audio to text or commands but also adapting the output on-the-fly as context is detected, identifying different speakers (i.e., “speaker diarization”, which only the most advanced ASR models can today do), adding time stamps with word resolution (“word-level timestamps”), filtering profanity and filler words, doing translation on the go, handling punctuation, and more.
How ASR systems work
Speech-to-text AI involves a complex process with multiple stages and AI models working in tandem. Before we delve into the main subject of this post, here’s a super-brief overview of the key stages in speech recognition. For a deep dive, check out our introduction to speech-to-text.
The first step is pre-processing the audio input through noise reduction and other techniques to improve its quality and suitability for downstream processing. The cleaned-up audio is subject to feature extraction, during which audio signals are converted into the elements of a representation tractable by the model doing the actual conversion to text.
Next, a specialized module extracts phonemes, the smallest units of sound. These pieces are then processed with some kind of language model that makes decisions about word sequences and then decodes these decisions into a sequence of words or tokens that make up the raw transcribed audio. Finally, at least one additional step takes place to improve accuracy and coherence, address errors, and format the final output.
In modern ASR, most, if not all, steps are coupled and consist of AI modules that contain transformers to preserve long-range couplings in the contained information – in other words, to preserve relationships between distant words in order to better and more coherently shape the transcribed text.
But of course, each exact ASR model will handle these steps differently, resulting in different accuracies, speeds, and tolerance to problems in the input audio. In addition, each ASR system will use different kinds and amounts of data for training, which also impacts their performance and properties, such as balance across different languages.
Review of the best ASR engines in 2023 and the models powering them
Through the last decade, ASR systems have evolved to achieve unprecedented accuracy, an ability to process hours of speech in minutes across multiple languages, and enriched with audio intelligence features that derive valuable insights from transcripts.
Here are the top leading open source and commercial providers in ASR in 2023 based on their overall performance in enterprise use cases. Note that we focus primarily on commercial speech-to-text providers in this article and are planning on covering open-source alternatives in a separate post soon.
Whisper ASR by OpenAI
OpenAI’s open-source model, Whisper, is the star of the moment, having set a new standard in ASR for accuracy and flexibility.
Trained on an impressive 680,000 hours of audio, this ASR model/system excels at accurate transcription and speed, with the model being able to transcribe hours of audio in a few minutes.
When released in September 2022, Whisper was likewise considered a breakthrough for multilingual transcription in 99 languages and its ability to translate speech from any of those languages to English. Its latest version, Whisper v3, was released just a few weeks ago and was heralded as an improvement to its predecessor in terms of accuracy in under-represented languages.
How it works
Whisper’s architecture is based on an end-to-end approach implemented as an encoder-decoder transformer. The cleaned-up input audio is split into 30-second chunks, converted into a spectrogram, and fed into an encoder.
Then, a decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and translation if required.
Put differently, the pre-trained transformer architecture enables the model to grasp the broader context of sentences transcribed and “fill in” the gaps in the transcript based on this understanding. In that sense, Whisper ASR can be said to leverage generative AI techniques to convert spoken language into written text.
Power and limitations
The hundreds of thousands of audio used to train Whisper models include 2/3 of the English language from various online sources. While this makes the model optimal for and foremost for English, it is by default more sensitive to varied accents, local expressions, and other language nuances than most alternatives.
Moreover, it’s one of the few ASR systems that can automatically detect language on the go, whereas most other ASR systems need the language to be defined upfront.
Among the biggest shortcomings of Whisper is hallucinations. Several users have reported the most recent model, Whisper v3 – announced as an improvement to its predecessor, v2, in terms of accuracy in under-represented languages – hallucinates more than its previous version.
The good news is that Whisper models are open-source, so they can be tweaked, adapted, and improved at will for specific needs, for instance, by fine-tuning it for specific languages and jargon and extending its feature set. Available in five sizes ranging from “just” 39 million to over 1.5 billion parameters, it allows developers to balance computational cost, speed, and accuracy as required by the intended use.
That said, when deploying the Whisper model(s) in-house for enterprise projects, one should be ready to assume significant costs resulting from high computational requirements and advanced engineering resources required to boost the core model’s capabilities at scale.
In contrast, Gladia’s enterprise-grade API, based on an enhanced proprietary version of Whisper, is a plug-and-play alternative, enabling any enterprise to overcome the model’s limitations while minimizing your time-to-market.
Our latest ASR model, Whisper-Zero, is more accurate, faster, multi-functional, and affordable than the original. It removes virtually all hallucinations, has enhanced language support for accents, and delivers exceptional results in complex noisy environments like call centers. To learn more about our approach and advantages, check out this dedicated post or sign up for your free trial directly here.
Google Speech-to-Text (Google Cloud)
Google Speech-to-Text is Google’s suite of cloud computing systems that provide modular services for computation, data storage, data analytics, management, and AI. Cloud AI services include text-to-speech and speech-to-text (ASR) tools. These power Google’s Assistant, voice-based search systems, voice-assisted translation, voice control in programs like Google Maps, automated transcription on YouTube, and more.
How it works
Google’s ASR services leverage a variety of models that tap into the company’s advanced AI capabilities. Although their exact nature is not disclosed, they naturally build on the giant’s own research in the field.
Older blog entries disclose how their early ASR systems worked. Those, however, predate the era of transformers, and more recent blog posts from Google Research include descriptions of Google Brain’s Conformer, a convolution-augmented transformer for speech recognition.
The latest information from Google Research’s blog explains that its latest ASR model is the Universal Speech Model (USM). This model is actually a family of speech models with 2 billion parameters trained on 12 million hours of speech and 28 billion sentences of text spanning over 300 languages. As made clear in their blog entry for USM, the underlying model is still the Conformer that applies attention, feed-forward, and convolutional modules to process the input spectrogram of the speech signal by convolutional sub-sampling after which a series of Conformer blocks and a projection layer produce the final output.
Power and limitations
Trained with data from over 300 languages and dialects, Google’s ASR system could, in theory, become the most multilingual system to date – especially as they aim to reach coverage for around 1,000 languages in the near future! However, this claim should be taken with a pinch of salt: given the current status of speech-to-text technology, achieving a sufficient level of accuracy for practical use in all these languages is highly challenging.
More realistically, today the main advantage of using Google’s ASR system is its renowned track record. As a major player in the cloud computing industry, Google offers a versatile integrated solution that caters to various AI and machine learning needs.
In principle, Google’s ASR systems should be largely scalable thanks to its googolplex resources. However, in practice, many of our clients have come to us after repeatedly experiencing poor quality and very long waiting times.
On top of this, users of Google’s ASR systems may encounter higher costs compared to smaller, highly specialized ASR providers. Its billing system is pretty inconvenient when rounding ASR time for billing, such that, for example, 15.14 seconds of speech-to-text conversion are rounded up to 30 seconds. Additionally, customization options are more limited than platforms focusing on audio intelligence functionalities.
Probably the biggest plus with Google’s speech-to-text systems is their native incorporation into Google Meet and Google Chrome. One can use the WebSpeech API in JavaScript to add speech recognition and speech synthesis capabilities to your apps with simple code that even non-experts can write at zero cost and without requiring any API keys (see this or this example).
However, notice that this free service is, in our experience, far from the state of the art, with rather poor accuracy compared to other models and substantial downtime – probably just not applying Google’s latest models but some older flavors.
Besides, this ASR system available in Chrome is not customizable (for example, grammar extensions specified in the Web Speech API are well-known to not work, already for years). And, of course, this ASR system only works if your user accesses your web page with the Chrome browser.
Azure (Microsoft’s Cloud Computing Platform)
Microsoft is another tech giant proposing its own ASR technology with Azure Speech-to-Text. Conforming to the expected state of the art, Azure offers speaker diarization, word-level timestamps, and other features, supporting both live and pre-recorded audio. Its big plus is probably its customizability, as detailed below.
How it works
If Google revealed little about how its USM system for ASR works, Microsoft is even closer with its proprietary technology for speech recognition.
Power and limitations
According to the company’s website, Azure transcribes audio to text in more than 100 languages and variants, performs speaker diarization to determine who said what and when, accepts live or recorded audio, cleans up punctuation, and applies relevant formatting to the outputs.
Developers can integrate Azure’s power in several programming languages, and like Google’s solution, there is not only extensive documentation but also a big user base with whom to consult.
Unlike other big companies, Azure’s most interesting feature is that the model can be customized to enhance accuracy for domain-specific terminology. In particular, one can upload audio data and transcripts to get automatically fine-tuned models. Moreover, using your own files created in Office 365, you can optimize speech recognition accuracy for their content in practice, thus resulting in a model tailored to your specific needs or your organization.
Amazon Transcribe
First released in November 2017, Amazon’s transcription tool, Amazon Transcribe, has been growing steadily more robust over the years to support a variety of languages and address various business verticals with custom vocabularies and industry-specific tools, like healthcare and call centers.
In the latest news, their transcription engine has just been upgraded a few weeks ago. As announced on AWS’s blog, the service now supports more than 100 languages (vs. 39 before), thanks to a new foundation ASR model, trained on millions of hours of unlabelled multilingual audio and aimed primarily at increasing the system’s accuracy in historically underrepresented languages.
How it works
Like in the case of Microsoft, little has been disclosed about the inner workings of Amazon’s propriety engine. Having not been able to test the model ourselves yet, here’s what we know about the latest release. First, the model aims at improving performance evenly across the 100 languages, achieved thanks to the training recipes optimized, as reported, “through smart data sampling to balance the training data between languages.” According to the blog post, this has helped pay-as-you-go Amazon Transcribe improve overall accuracy by 20-50% in most languages.
Moreover, the latest release expands several key features across all 100+ languages, including automatic punctuation, custom vocabulary, automatic language identification, speaker diarization, word-level confidence scores, and custom vocabulary filters.
Power and limitations
With the new model, Amazon consolidates its track record of delivering a one-stop-shop transcription experience, combining speech-to-text with a suite of additional features related to ease of use, customization, user safety, and privacy.
The company’s clear edge lies in its direct access to large volumes of proprietary data and overall cloud infrastructure enabling scale. The company’s targeting of specific verticals is likewise promising, with the call center analytics branch being allegedly powered by generative AI models that summarize interactions between an agent and a customer.
Now, the downsides: as is the case of all big tech providers, long processing time is a widely reported usage inconvenience. Alongside Google, AWS Transcribe is among the most expensive commercial alternatives, charging over $1 per hour of transcription. While it remains to be known whether the new model truly delivers better accuracy, the previous price-quality ratio was mostly disadvantageous to users.
We also look forward to receiving additional feedback on the latest model’s real performance across languages since even the best-of-class multilingual models like Whisper ASR struggle to achieve even results in terms of accuracy for all of its 99 supported languages.
Whisper-Zero by Gladia
A newcomer in the space, Gladia launched its audio transcription API in late June, with the mission to provide a zero-hassle fast, accurate, and multi-lingual transcription and audio intelligence AI to companies and offer the best production-grade solution based on Whisper ASR. Its key differentiation is in the many optimisations to the open source model, aimed at overcoming its historical limitations and enhancing its capabilities to fit enterprise scale and needs.
Gladia's latest model, Whisper-Zero, released in December 2023, built using over 1.5 million hours of real-life audio, eliminates hallucinations and drastically improves accuracy.
How it works
As mentioned previously, Whisper comes in five different sizes, with larger models delivering better accuracy at the expense of longer processing times and higher computational costs.
Gladia's main objective was to find the perfect balance on the Whisper spectrum and turn the model into a top-quality, fast and economically feasible transcription tool for enterprise clients. Today, the Whisper-Zero's transcription engine is based on a proprietary hybrid architecture, where optimization occurs at all key stages of the end-to-end transcription process described previously.
The resulting system operates as an ML ensemble, where each step is powered by the enhanced Whisper architecture combined with several additional AI models. As a result of this multi-tier optimization, the model is able to achieve superior accuracy and speed to both open-source and API Whisper, especially in real-life use cases.
Power and limitations
Gladia API takes the best of Whisper ASR and overcomes its limitations. More specifically, any company wishing to find Whisper-level accuracy at scale, with more features included, would find the right fit here.
Among the core achievements of the Gladia team is removing 99.9% of hallucinations - an infamous shortcoming of the original Whisper known to one too many users. Moreover, the company paid special attention to integrating high-value proprietary features like speaker diarization, code-switching, and live transcription with timestamps.
Their product also stands out for multilingual capabilities, with Whisper-Zero integrating a new language model that resolves the historically underserved pain point of thick accents. However, the API is currently more limited in audio intelligence features compared to other alternatives.
Conformer-2 by Assembly AI
AssemblyAI intends to propose a secure and scalable API for ASR-related tasks, from basic speech recognition to automatic transcription and speech summarization, trying to stand out for ease of use and specialization for call centers and media applications.
How it works
The main ASR model powering AssemblyAI’s ASR systems is Conformer-2, released recently as an evolution of their Conformer-1. These Conformer systems rely on Google Brain’s Conformer, which, as introduced above for Google’s ASR systems, consists of a transformer architecture combined with convolutional layers – a prominent type of deep neural network used in ASR.
As AssemblyAI explains on its website, the regular Conformer architecture is suboptimal in terms of computational and memory efficiency. The attention mechanisms essential to capture and retain long-term information in an input sequence is in fact a well-known bottleneck of these processing units. AssemblyAI’s Conformer-2 presumably addressed this limitation, achieving a more efficient and scalable system.
AssemblyAI’s current ASR model, the Conformer-2, is trained on 1.1 million hours of English audio data, providing robustness to the recognition of problematic words like proper nouns and alphanumerics, besides being more stable to noise and having lower latency than its predecessor the Conformer-1.
Power and limitations
AssemblyAI’s API includes features such as speaker counting and labelling, word-level timestamps and scores, profanity filtering, custom vocabulary (a now-standard feature that serves to incorporate subject-specific jargon), and automated language detection. The system is generally appreciated by the users for consistent accuracy in English.
On the downside, some users have reported inconsistent performance in languages other than English. Issues such as language detection and code-switching may present challenges. Users should consider these factors, especially in applications requiring robust language handling.
Update: As of April 2024, Assembly AI's latest core model is Universal-1, boasting a series of enhancements relative to Conformer-2.
Nova by Deepgram
Deepgram offers speech-to-text conversion and audio intelligence products, including automatic summarization systems powered by language models.
Using Deepgram, developers can process live streams or recorded audio and transcribe it at a fast speed to power use cases in media transcription, conversational AI, media analytics, automated contact centers, etc.
How it works
Deepgram’s ASR system relies on Nova (version 2 since September of this year), a company's proprietary model based on two transformer-based sub-networks. One transformer encodes audio into a sequence of audio embeddings, and a second transformer acts as a language transformer that decodes the audio embeddings into text given some initial context from an input prompt. Information flows between these two sub-networks through an attention mechanism.
Based on proprietary technology, the transformers used by Nova have been modified from archetypal transformers to correct weak points that led to suboptimal accuracy and speed for audio transcription.
Power and limitations
Deepgram stands out for its remarkable processing speed, making it one of the fastest API providers in the market.
On the flip side, Deepgram’s users may encounter potential trade-offs with accuracy, especially in scenarios where the need for rapid processing may impact the precision of transcription results, such as in live transcription or batch processing of a large number of audio files. Users should then carefully assess their specific requirements and the trade-offs associated with speed and accuracy.
Another limitation is that Deepgram’s focus seems to be primarily on English, and while it does support other languages, it might not be as accurate for languages with less extensive training data, just as observed with Whisper.
NB: Beyond Nova, the company is offering the possibility of training customized models for unique use cases. Yet, one should remember that fine-tuning a model is an investment-heavy solution to a problem that could be addressed with less costly yet effective techniques like prompt injection.
Ursa by Speechmatics
Speechmatics is the first in our series of specialized contenders. The company developed proprietary ASR and NLP models combined in a single API that powers systems for transcription with language recognition, translation, summarization, and more.
Speechmatics aims to distinguish its products with robust support for over 45 languages and dialects. One of its updates, in 2018, made Speechmatics the first ASR provider to develop a comprehensive language pack that incorporates all dialects and accents of English into one single model.
Speechmatics’ website showcases example applications in the creation of automated support center solutions, closed captioning from files or live feeds, monitoring mentions and content, automated notetaking and analytics in virtual meetings, and more.
How it works
Speechmatics’ use of AI systems for ASR technology dates back to the 1980s when its founder pioneered the approach as part of his studies at Cambridge University.
Their latest model, called Ursa, is powered by three main modules. First, a self-supervised model trained from over 1 million hours of unlabeled audio across 49 languages grasps acoustic representations of speech.
Second, these representations of speech are processed through a network trained from paired audio-transcript data to produce phoneme probabilities.
Third, these phoneme probabilities are mapped into the output transcript using a large language model that identifies the most likely sequence of words given the input phonemes.
Speechmatics reports its ASR system was optimized for GPUs in order to support operation at scale. Although this is a standard prerequisite for all enterprise-grade APIs we discuss here, this optimization further allows Ursa to process a large number of audio streams in parallel and, in particular, manage multiple voice inputs when performing speaker diarization.
Power and limitations
Speechmatics presents its ASR system as the world’s most accurate, claiming substantial performance accuracy gains compared to Microsoft’s Azure-based ASR and OpenAI’s Whisper. However, the latest comparison available on their website is from March 2023, when Whisper v3 was not out yet.
What you can check yourself right away is how Speechmatics’ ASR system performs with real-time captioning and translation services. Their website shows an example of a live feed from the BBC World Service transcribed nearly perfectly in real-time with minimal delay, accompanied by a live translation.
On the downside, some users have reported difficulties scaling up the model to handle large volumes of transcription requests, which can be a limitation for enterprise-level applications. Besides, as we have covered in this previous blog post, Speechmatics’ pricing structure can be complex and may lead to unexpected costs.
Conclusions
In this exploration of leading ASR systems in 2023, we’ve highlighted key features and considerations for the options offered by Deepgram, Assembly AI, Google, and Whisper/OpenAI. The evaluation encompassed factors such as speed, accuracy, language support, features, and pricing.
Of course, the choice of an ASR system should align with specific organizational needs and use cases:
Bear in mind that accuracy and speed in ASR are inversely proportional, meaning that you usually need to sacrifice one – at least to some extent – to get 100% of the other. That said, it is the engineering mastery of a specific provider that will determine which API will manage to strike the right balance between the two at the most affordable cost.
Pay special attention not only to costs, speed, and accuracy but also to details such as coverage and accuracy for languages of relevance to your application beyond English. Bear in mind that commercial claims as to the extent of language support don’t always correspond to reality.
Decide how important it is that the model correctly identifies different speakers.
Decide whether your application requires built-in audio intelligence features like summarization or you can run that separately.
Assess to what extent your use case requires additional guidance with custom vocabularies or even fine-tuning, or perhaps you might need an automated check of profanity and filler words, handling of punctuation, etc.
Probably the best is to try each ASR system, run independent benchmarks on your own datasets, and make an informed decision from those tests.
Our study indicates that for enterprises prioritizing speed and customizable AI models, Deepgram emerges as a strong option for highly specialized use cases, with English as the main language.
For users valuing ease of use and specialization in call centers and media applications, Assembly AI stands out with its user-friendly APIs and a proprietary version of Google’s Conformer. Or you could go right away with Google’s own solutions, with the caveats mentioned.
Speechmatics appears most suitable for those requiring multi-language support and real-time translation, although users should be mindful of its complex pricing structure and scalability challenges.
Whisper, with its enhanced accuracy and live transcription capabilities, is recommended for applications prioritizing transcription precision, even though it may have some limitations in audio intelligence features and in handling audio/text that is not in English, even more so through Gladia’s customized solution, which strikes a balance for accuracy, speed, flexibility, and cost, also removing Whisper’s cumbersome limitations.
Article written in collaboration with Luciano Abriata, PhD.
About Gladia
At Gladia, we built an optimized version of Whisper in the form of an API, adapted to real-life professional use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities, and state-of-the-art features, including speaker diarization and word-level timestamps.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Product News
Our Road to Real-Time Audio AI – with $16M in Series A funding
Real-time audio AI is transforming the way we work and build software. With instant insights from every call and meeting at their fingertips, customer support agents and sales reps will be able to reach new levels of efficiency and deliver a more delightful customer experience across borders.
Gladia selected to participate in the 2024 AWS Generative AI Accelerator
We’re proud to announce that Gladia has been selected for the second cohort of the AWS Generative AI Accelerator, a global program offering top early-stage startups that are using generative AI to solve complex challenges, learn go-to-market strategies, and access to mentorship and AWS credits.
How to implement advanced speaker diarization and emotion analysis for online meetings
In our previous article, we discussed how to unlock some of that data by building a speaker diarization system for online meetings (POC) to identify speakers in audio streams and provide organizations with detailed speaker-based insights into meetings, create meeting summaries, action items, and more.