Best speech-to-text APIs in 2023

Best speech-to-text APIs in 2023

Best speech-to-text APIs in 2023
Published on
Apr 2024

Speech-to-text (STT), also known as automatic speech or voice recognition, is a type of AI technology that recognizes human speech in audio or video and transcribes it into written output. In the form of an API, it can power a variety of applications, ranging from call bots to voice assistants to AI-powered virtual meeting platforms.

The commercial landscape for speech-to-text APIs today consists of the big cloud providers Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, the famous outlier OpenAI, and specialized contenders like Gladia, Assembly AI, and others featured below.

In fact, having enjoyed a quasi-monopoly of Big Tech in the past, the market is slowly but surely being taken over by this handful of API providers who are making AI transcription more accessible to the public while largely improving its core performance and capabilities. 

Our Top 5 list will focus on this market segment, being the most suitable one for enterprise-grade clients who seek production-ready, affordable APIs that can handle large volumes without compromising quality. At the end, we outline the reasons for excluding cloud providers from the final selection. 

How to evaluate speech-to-text providers

Independent STT providers design their APIs on top of existing open-source alternatives and/or build their own models to power those. Whatever the underlying architecture, a common challenge to us all is striking a perfect balance between speed, accuracy, and cost.

In fact, speed and accuracy in ASR are inversely proportional – meaning that you can’t improve one without compromising the other, at least to some extent. Cost, in turn, is a resulting factor between the two. Ultimately, it is the engineering mastery of a specific provider in addressing this tradeoff that determines the overall quality of one’s API transcription.

When deciding on a provider, it is essential to first thoroughly consider your needs, intended use case, and, of course, the budget. We recommend focusing on the following key parameters:

  • Speed and accuracy. While some use cases require speed over accuracy (e.g., call centers, financial sector), others are less time-sensitive but are less tolerant to errors, be it due to the sensitive nature of information (e.g., medical prescriptions) or the high-precision nature of the activity (e.g., content editing). The types of models used by the provider can be interesting to look into when evaluating overall quality. 
  • Language support. Includes the ability to transcribe with equal accuracy from a wide range of languages, automatically detect the language, perform code-switching, and translate if needed. Many providers support x number of languages in theory but fail to deliver in practice.
  • Features. Ranging from live transcription to diarization and world-level timestamp. You may wish to find a one-stop-shop provider for all your needs or pick one with the best core transcription capabilities and build your custom features like summarization around it with the help of complementary tools like ChatGPT.
  • Pricing. The cheapest solutions will come with tradeoffs, no matter what the providers may want to say on their website. The realities of hardware and software costs in ASR make it impossible to deliver top-quality transcription with an overtly low-cost approach, even with the benefit of economies of scale that some of the more established providers may enjoy. 
  • Privacy. Given the highly confidential nature of enterprise audio data, it’s becoming increasingly important to verify how an API provider approaches data privacy.

Our Top 5 speech-to-text APIs in 2023

Here comes our list of the best five speech-to-text API providers today, based on market research, user interviews, and surveys conducted by Gladia.

Assembly AI 

Assembly AI is many user’s default choice. Founded in San Francisco, it’s a prominent player with over 6 years in the speech-to-text space, distinguished by a variety of features that cater to different user needs.

Best known for:

  • Ease of use. Assembly AI provides user-friendly APIs, SDKs, and Playground and is the first player in the space to have built an LLM-based LeMUR, enabling companies to build chat-based apps on spoken data.
  • Specialization. The company has a proven track record in both call center and media segments and is especially known in the podcast world.
  • Models. Assembly’s proprietary version of Google’s Conformer leverages transformer and convolutional layers for speech recognition at scale. 

Room for improvement

While Assembly AI generally provides high accuracy, some users have reported inconsistent performance when it comes to language detection, code-switching, and other language support issues. Word-level timestamps in live transcription are generally reposted as inaccurate, too.

Update: As of April 2024, Assembly AI's latest model is Universal-1, boasting a series of enhancements relative to its previous core model Conformer-2.


Deepgram is another California-based pioneer of the speech-to-text APIs market that stands out for its customizable AI models and speed. 

Best known for:

  • Speed. Hands down, it is among the absolute fastest API providers on the market at ~20 s/h of batch transcription.
  • Models. Deepgram’s proprietary model Nova is based on a Transformer architecture and has recently undergone a second upgrade. Available in English. 
  • Customization. Deepgram's à la carte offer allows users with sufficient in-house expertise and resources to fine-tune custom models to adapt them to specific industry or use-case requirements.
  • Pricing. The company offers highly competitive pricing with custom discounts.

Room for improvement

Due to the ASR conundrum explained earlier, Deepgram’s fast inference speed and low pricing combo may come at the cost of accuracy for both batch and live transcription. Moreover, their primary focus is on English, and while it does support other languages, it is not as versatile for languages with less extensive training data. 


Speechmatics is a UK-based speech recognition company with a strong global presence.

Best known for:

  • Multi-language support. Speechmatics boasts support for 45+ languages and dialects, allowing broad enough range of international applications.
  • Real-time translation. The company offers a still rare and technically challenging live translation service, making it suitable for use cases like media broadcasting.

Room for improvement

Speechmatics' pricing structure, including different pricing for ‘basic’ vs. ‘enhanced’ quality of transcription, can be complex and may lead to unexpected costs. Some users also report experiencing difficulties in scaling up to handle a large volume of live transcription requests, which can be a limitation for enterprise-level applications., a product of, is a well-known speech-to-text service from the US that stands out for its use of AI- and human-generated transcripts.

Best known for:

  • Human reviews. The platform offers a hybrid approach, combining automated transcription with human reviewers, to guarantee enhanced accuracy and quality of transcripts.

Room for improvement's human-reviewed transcription service, while accurate, can introduce delays in transcription, i.e., it may take as long as 20 minutes for languages other than English. The hybrid models also make it relatively expensive compared to fully automated options. Finally, its language support is more limited than alternatives. 


Gladia is a French startup founded in 2022. You may not have seen our name on many similar lists just yet – which gives us a great excuse to include ourselves in this one. Jokes aside, we may be a newcomer, but the quality of our speech-to-text API is among the best according to clients – including at TechCrunch – who had struggled with the alternatives in the past. 

Best known for:

  • Models. Gladia enhanced and optimized OpenAI’s Whisper ASR for improved accuracy, variety of functions, and ability to handle larger volumes.
  • Live transcription. State-of-the-art live transcription is distinguished by exceptional accuracy and adjacent features, including live word-level timestamps. 
  • Language support. We support translation to and from 99 languages and extend this support to all other features in the portfolio like speaker diarization.
  • Code-switching. Occurs when speakers switch between several languages during the same call or event. Gladia is the only API on the market that provides a reliable and accurate code-switching feature. 
  • Privacy policy. As the only EU-based company on the list, Gladia stands out for its GDPR-compliant data policy.

Room for improvement

The API is currently more limited in audio intelligence features compared to other alternatives. 

Why exclude Microsoft Azure, Google Cloud Speech-to-Tex, etc., from this selection?

All in all, the market is varied. There may not be one right API or open-source alternative for all — everything depends on your use case, your ideal tradeoff between speed, quality, and price, whether you require extra audio intelligence features and many other factors. 

However, we can conclude that across the five key parameters proposed below – speed, accuracy, supported languages, price, and extra features – in most cases, Big Tech providers aren’t offering the best value for your money. 

Take accuracy, for instance. Whereas Big Tech providers have a WER of 10%-18%, most startups and specialized providers are within the 1-10% WER range. Moreover, their speed tends to reach 25 minutes for an hour-long piece of audio. In addition to that, their pricing tends to be significantly higher than the alternatives listed above, which makes them less suitable at scale. 

While the reasons for this vary, one explanation is that ASR isn’t Big Tech’s core business; providers such as Amazon, Microsoft, and Google provide ASR services as part of a broader package in their suite. If we had to pick one, however, it appears that Microsoft Azure can provide a generally fulfilling performance for some use cases, as confirmed by our users.

As to the reasons to excluding OpenAI's open-source Whisper, you can find a detailed breakdown of the model's practical limitations for enterprise users here.

Closing remarks

Taking stock, Gladia, Assembly AI, Deepgram,, and Speechmatics all offer valuable speech-to-text solutions. Understanding their advantages and limitations is key to making an informed choice based on your specific needs and priorities.

At Gladia, we built an optimized version of Whisper in the form of an API, adapted to real-life use cases and distinguished by consistent accuracy, extended language support, and state-of-the-art features, including speaker diarization and word-level timestamps. If you'd like to speak with us, feel free to book a demo or sign up directly below.

Contact us

Your request has been registered
A problem occurred while submitting the form.

Read more


Should you trust WER?

Word Error Rate (WER) is a metric that evaluates the performance of ASR systems by analyzing the accuracy of speech-to-text results. WER metric allows developers, scientists, and researchers to assess ASR performance. A lower WER indicates better ASR performance, and vice versa. The assessment allows for optimizing the ASR technologies over time and helps to compare speech-to-text models and providers for commercial use. 


OpenAI Whisper vs Google Speech-to-Text vs Amazon Transcribe: The ASR rundown

Speech recognition models and APIs are crucial in building apps for various industries, including healthcare, customer service, online meetings, and entertainment.


Best open-source speech-to-text models

Automatic speech recognition, also known as speech-to-text (STT), has been around for some decades, but the advances of the last two decades in both hardware and software, especially for artificial intelligence, made the technology more robust and accessible than ever before.