OpenAI Whisper vs Google Speech-to-Text vs Amazon Transcribe: The ASR rundown
OpenAI Whisper vs Google Speech-to-Text vs Amazon Transcribe: The ASR rundown
Published on
Sep 2024
Speech recognition models and APIs are crucial in building apps for various industries, including healthcare, customer service, online meetings, and entertainment.
Among the many options available today to power such apps are Big Tech providers, open-source models, and specialized API providers. Each of these offers unique features and capabilities, catering to the diverse needs of businesses and developers.
In this article, we will comprehensively compare some of the most popular platforms in the space: OpenAI Whisper, Google Speech-To-Text, and Amazon Transcribe. They will be evaluated on several criteria, such as accuracy, speed, features, language support, pricing structure, ecosystem compatibility, privacy, and security, to help you make an informed decision based on your project requirements.
What is Whisper ASR?
Whisper ASR is a neural net released by OpenAI. Trained on 680,000 hours of multilingual audio, the model became highly popular among open-source communities and businesses for its accuracy and multilingual capabilities.
In addition to AI transcription, or speech-to-text, the model can translate from 99 languages to English. Available in five sizes – ranging from 39 million to 1.55 billion parameters – the Whisper family allows developers to strike the appropriate balance between accuracy and processing time. One can add custom vocabulary to Whisper or fine-tune the model for additional languages, specialized jargon, and more. You can learn more about Whisper in this deep dive.
Currently, Whisper is available as both an open-source (OSS) model and an API. In this article, we will reference both to make the comparison as comprehensive as possible.
What is Google Speech-to-Text?
Google Speech-to-Text is Google’s suite of cloud computing systems that provide modular services for computation, data storage, data analytics, management, and AI. That includes a speech-to-text (ASR) tool, powering Google’s Assistant, voice-based search systems, voice-assisted translation, voice control in programs like Google Maps, automated transcription on YouTube, and more.
Google’s ASR services leverage various models that tap into the company’s advanced AI capabilities. The latest information from Google Research’s blog explains that its latest ASR model is the Universal Speech Model (USM). This model is a family of speech models with 2 billion parameters, trained on 12 million hours of speech and 28 billion sentences of text spanning over 300 languages.
As made clear in their blog entry for USM, their underlying model architecture is a Conformer that applies attention, feed-forward, and convolutional modules to process the input spectrogram of the speech signal by convolutional sub-sampling. After that, a series of Conformer blocks and a projection layer produce the final output.
What is Amazon Transcribe?
First released in November 2017, Amazon Transcribe has been growing steadily more robust over the years to support a variety of languages and address various business verticals with custom vocabularies and industry-specific tools, like healthcare and call centers.
In the latest news, their transcription engine was upgraded a few months back. As announced on AWS’s blog, the service now supports more than 100 languages (vs. 31 in 2019), thanks to a new foundation ASR model trained on millions of hours of unlabelled multilingual audio and aimed primarily at increasing the system’s accuracy in historically underrepresented languages. Beyond this, little has been disclosed about the inner workings of Amazon’s propriety engine on the official channels.
‘The final tally’: which API is best for what?
At first glance, these models and APIs seem remarkably robust: fast, accurate, and support multiple languages and functionalities. In this comparison, we want to look beyond the headlines and explore what OpenAI, Google, and Amazon have to offer developers, product owners, and business executives.
Accuracy and speed
Accuracy is paramount when it comes to speech recognition. In professional settings, precise transcription is a prerequisite for building great customer-facing apps, as it’s being increasingly used for LLM-based features and is synchronized with other business applications, like CRM.
The same goes for speed: while not all use cases require an instant output response from the API, given the recent advances in speech recognition tech, it is expected that transcription shouldn’t take longer than a few minutes for an hour-long audio file.
The most common metric used to assess transcription quality in this industry is Word Error Rate (WER).
There are essentially two ways to calculate one’s WER. One is running benchmarks with publicly available datasets like Common Voice or Librispeech. The second is to use their proprietary datasets, which may include real-time professional audio from complex environments such as noisy call center audio.
The numbers available publicly may vary because of the many ways and datasets used to calculate WER. Here’s what we found based on secondary research and primary data when it comes to WER in English:
OpenAI’s Whisper-v2, the most accurate Whispers, has a median WER of 8.06% and takes 10-30 minutes on average to transcribe one hour of audio. As notes previously, a big advantage with Whisper is that the model comes in various sizes, enabling developers to strike the right balance between speed and accuracy.
Google Speech-to-Text appears less accurate, with reported WER ranging from 16.51% to 20.63% and an average speed of 20-30 mins for an hour of transcription.
Amazon Transcribe is similar, with 18.42%-22% WER at roughly the same speed as Google.
Verdict: OpenAI’s Whisper is hands-down the most accurate of the three and will, in most cases, generate output much faster than both Google and Amazon, based on our tests and those conducted by our users and customers. That said, the open-source model is prone to hallucinations, which must be mitigated to meet enterprise needs.
Additional features: live transcription and audio intelligence
Real-time transcription(also known as ‘live streaming’) is among the most crucial features for use cases requiring immediate response, such as live captioning at events. However, given the need for high accuracy and low latency, it's notoriously tricky to master.
Speech-to-text alone – batch or live – may not be enough for businesses looking to build advanced voice-based apps. This is where the so-called audio intelligence features come in.
Besides live transcription, other features are becoming increasingly crucial beyond just transcription, providing the ability to gain more valuable insights about the transcripts – like who has spoken when – and to secure the sensitive data.
Verdict: The Big Tech players have a clear advantage vis-a-vis open-source Whisper regarding additional audio intelligence features. Amazon, in particular, provides the most complete offer, including custom models and specialized modules for medical and call center transcriptions.
Language support
Given the ever-growing importance of multilingual APIs to serve a global customer base, another crucial aspect to consider when comparing our Big 3 is language support.
A quick heads-up before you dive into this: not all languages are supported equally, and it helps to make a distinction between multilingual transcription (i.e. the ability to transcribe languages in several languages) and audio intelligence (i.e. language support extending to the post-transcription analysis).
Whisper ASR is considered state-of-the-art for multilingual transcription. It can transcribe in 98 languages, including English, Spanish, French, and German, and translate from any of the supported languages to English—yet we must remember that its WER in English outperforms other languages due to English making up ⅔ of its total training data. Automatic language detection comes as part of the package, too.
Contrary to other providers, OpenAI has been transparent about the real WER of each of its languages. As stated in the Whisper API docs, any language that exceeds <50% word error rate (WER) is not listed as ‘supported’. The model will return results for languages not listed above, but the quality will be substantially lower.
For users working in underserved languages from the Whisper list, fine-tuning the model for a specific language and/or accent is always an option. To learn more about this technique in ASR generally and for Whisper specifically, read our recent guide on fine-tuning ASR models.
Google Speech-to-Text boasts support for over 125 languages and dialects, including the many variations of the English language. The architecture of the model powering the service was trained on noisy data, varying accents, and different languages, which helps handle noisy audio inputs, challenging speech patterns, and effectively transcribing multilingual content. Google also supports the customization of the model through a feature known as model adaptation, which enables users to tune it to recognize specific words, phrases, and classes.
The latest version of Amazon Transcribe, released at the end of last year, is positioned as its most multilingual to date, with over 100+ languages supported (up from 31 previously). It also automatically identifies languages, adds custom vocabulary, and reduces noise in audio inputs.
Verdict: Considering language support and customization, Google Speech-To-Text surpasses its competitors by offering the most language support on paper. Given the company’s WER in English alone, the official lineup is to be taken with a grain of salt. Whisper, however, is widely recognized as having the highest accuracy across the largest number of languages in real-life use cases.
Cost and pricing structure
As explained in one of our earlier blog posts on navigating the commercial speech-to-text landscape, you want to find the sweet spot between performance, scalability, and price when searching for an optimal STT provider.
The latter is a significant factor for businesses evaluating speech recognition solutions, with the market serving everything ranging from ‘low-cost’ transcription-only packages to premium ‘all-features-included’ solutions. Given the large volumes in need of transcription by audio-first companies, the overall monthly budget can go up pretty quickly, even where the more standardized offers are concerned.
Here’s where our contenders stand on the pricing grid:
Verdict: While all three services follow the same pricing structure with a usage-based model, Big Tech companies are renowned for charging a premium for their products; the premium doesn’t necessarily come with a corresponding quality gain. In that sense, OpenAI Whisper offers the best price-quality ratio here with a $0.006/min charge.
Integration and ecosystem compatibility
Seamless integration with existing workflows and compatibility with other software platforms are other essential considerations for an API user. Here, you’d be looking for an optimal onboarding and developer experience, as well as throughout and easy-to-navigate documentation.
OpenAI Whisper API offers many integration capabilities, supporting popular programming languages and frameworks like Python, JavaScript, and TensorFlow.
Its cloud-based architecture ensures easy deployment and scalability, enabling smooth integration with existing systems and applications. Performing a simple transcription using the OpenAI Whisper API requires less than six lines of code and involves minimal complexity. This code can automatically detect the languages present in the audio file. The documentation is clear, easy to navigate and getting started is straightforward.
Google Speech-to-Text is integrated with Google Cloud Platform services, which helps to facilitate data sharing and synchronization across multiple Google products and services. Its RESTful API and client libraries also support various programming languages, simplifying the development process for users familiar with Google's ecosystem.
The code needed for a simple transcription with the Google speech-to-text model is as simple as Whisper’s. At the same time, the only added complexities are setting up a configuration to connect to Google Cloud and specifying the language codes. Despite being offered $300 worth of credits for free upon starting, the onboarding process via their playground can feel counterintuitive at the start. On the flipside, you’ll get a range of customization options, coupled with a comprehensive overview of one’s transcription history, and detailed info on each file. At the first attempt, we also run into difficulties with testing live transcription in Python, with Google’s dense, arborescent docs providing no easy-to-find solutions.
Amazon Transcribe comes with an SDK that supports various programming languages, such as .NET, Go, Java, JavaScript, PHP, Python, and Ruby. The documentation is exhaustive and contains code snippets on how to use the various features of the AWS transcribe service, whether you’re using CLI, SDK, or the console. Get ready for a very long and painful first registration process, though, requiring exhaustive personal and credit details, followed by a developer experience similar to Google’s in terms of documentation density.
Verdict: Whisper is a winner for us here, being the only API with a truly intuitive onboarding, which doesn’t require a credit card upfront.
Privacy and security
Commercial enterprises that collect and process customer data must have measures in place to ensure the privacy and security of the data they collect, use, or store.
Amazon Transcribe employs comprehensive security measures to safeguard your data during its transport over the internet and provide authentication using a cryptographic protocol, Transport Layer Security(TLS). The encryption of the data is also facilitated through the use of AWS certificates. Amazon Transcribe uses KMS keys to provide enhanced encryption measures for safeguarding data. They also enable users to encrypt their input media during the transcription process, while integration with AWS KMS allows for the encryption of the output when making requests.
Usage of the Whisper OSS does not come with any data security or privacy risks, as all audio data and the resulting transcripts are stored locally on your device. If using the OpenAI Whisper API, the audio uploads and the transcription are stored for up to 30 days for service improvement and abuse detection. However, users can opt for Zero Data Retention, ensuring all data are deleted after processing. At this time, additional information on their data encryption practices is unavailable.
Verdict: Amazon Transcribe ensures proper encryption of your data during transmission using TLS, and Google doesn’t store your data; instead, it opts to process it in memory. This is a tie for Amazon and Google as both services ensure best practices to safeguard your data. For OpenAI, while being able to opt out of data retention is a good thing, there is little to no information available on the practices they follow to guarantee adequate protection of stored data.
Final remarks
Let's do a quick rundown. OpenAI Whisper API is ideal for users who prioritize speed, accuracy, and affordability. For users more concerned about language support, we can vouch for Whisper, having optimized the model firsthand. However, it is essential to note that there are hidden costs associated with Whisper OSS, and the speed & quality will come down to your hardware specifications.
For users in search for an all-features-included experience, the Big Tech may be a better alternative, albeit at a potential cost of quality and speed. Amazon Transcribe in particular is worth trying for specific use cases such as call centre analytics and medical transcriptions, provided you have the budget for it.
About Gladia
At Gladia, we built an optimized version of Whisper in the form of an API, which takes the best of the core model, removes its shortcomings and extends its feature set at scale for enterprise use. Our latest hallucinations-free model, Whisper-Zero, is distinguished by exceptional accuracy in noisy and multilingual environments.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
Key techniques to improve the accuracy of your LLM app: Prompt engineering vs Fine-tuning vs RAG
Large Language Models (LLMs) are at the forefront of the democratization of AI and they continue to get more advanced. However, LLMs can suffer from performance issues, and produce inaccurate, misleading, or biased information, leading to poor user experience and creating difficulties for product builders.
Keeping LLMs accurate: Your guide to reducing hallucinations
Over the last few years, Large Language Models (LLMs) have become accessible and transformative tools, powering everything from customer support and content generation to complex, industry-specific applications in healthcare, education, and finance.
Transforming note-taking for students with AI transcription
In recent years, fuelled by advancements in LLMs, the numbers of AI note-takers has skyrocketed. These apps are increasingly tailored to meet the unique needs of specific user groups, such as doctors, sales teams and project managers.