Here’s how we optimized Whisper ASR for enterprise scale

Here’s how we optimized Whisper ASR for enterprise scale

Here’s how we optimized Whisper ASR for enterprise scale
Published on
Mar 2024

In this article, we give you a breakdown of features and parameters that distinguish Gladia API from both open-source and API versions of OpenAI’s Whisper ASR model. 


  • Gladia API is powered by an enhanced and optimized version of OpenAI’s Whisper ASR that is more accurate, faster, multi-functional, and affordable than the original.
  • Owing to its generic training data, limited features, and high in-house deployment costs, open-source Whisper shows serious limitations when applied in real-life professional use cases that require accurate ASR at scale.
  • With our API, your company can enjoy the best of Whisper without limitations and with a number of value-adding features like speaker diarization and word-level timestamps. 

Released in 2022, Whisper ASR took the speech-to-text community by storm. Trained on 680,000 hours of multitask supervised data, OpenAI’s groundbreaking neural net set a new standard for speech recognition in terms of accuracy, robustness and multilingual capabilities. 

Having identified it as the best state-of-the-art tool to address the infamous shortcomings of speech recognition systems, we optimized Whisper ASR: first, on core performance parameters to address speed, accuracy, and hallucinations. 

Then, we moved on to add high-value functionalities, including real-time transcription, speaker diarization, word-level timestamp, and code-switching. Our API also supports a larger variety of audio input in terms of both format and size. 

Today, Gladia’s enterprise-grade API allows you to experience Whisper AI as as-a-service: on-cloud and without tech limitations. Our latest model, Whisper-Zero, removes 99.9% of hallucinations, with enhanced accuracy and language detection. You can try it below for free or continue reading.

Why optimize Whisper to begin with?

Whisper ASR was a truly remarkable achievement in the field of automatic speech recognition. Its pre-trained transformer architecture enables the model to grasp the broader context of sentences transcribed and “fill in” the gaps in the transcript based on this understanding.

In that sense, Whisper ASR can be said to leverage generative AI techniques to convert spoken language into written text. More specifically, it transcribes speech in a two-step encoder-decoder process: first, it generates a mathematical representation of the audio, which it then decodes using a language model. This involves processing the audio through the model's layers to predict the most likely sequence of text tokens — basic units of text used for processing.

This approach has enabled it to achieve great accuracy and made it possible to process and translate speech from multiple languages. 

That said, the base Whisper model has serious limitations. While it does represent an exceptional achievement in the field, it was never designed by OpenAI as a production-ready enterprise tool. Case in point, most users we’ve spoken to who had attempted to deploy Whisper themselves experienced a number of constraints related to performance, features, and volume.

The key advantage of a speech-to-text API based on Whisper is in the work done behind the scenes to overcome these shortcomings. Commercial vendors leverage AI engineering know-how to optimize and fine-tune the model for you, making it more computationally efficient to run, more accurate in real-life business use cases, and more adept at handling diverse business applications. APIs are also quick to integrate, easy to use, and require no previous AI expertise.

That doesn’t mean hosting Whisper by yourself is never an option. If you simply want to create a product demo, launch an indie project, or conduct research on AI speech recognition, then Whisper’s open-source model might be perfectly adequate for you. Alternatively, Gladia's free plan allows you to experience an enhanced version of Whisper for 10 hours/month. 

According to an internal user survey we did, any enterprise-grade project that needs over 100 hours of transcription per month risks running into issues with Whisper, given the human, hardware, and maintenance costs involved. 

Feel free to check out our past blogs on the factors to consider when building yourself vs. getting a commercial speech-to-text API and estimating the total cost of ownership of in-house Whisper. 

Whisper vs. Gladia: Ultimate Comparison 

Inference and accuracy

Whisper comes in five different sizes – ranging from 39M to 1.55B parameters – with larger models delivering better accuracy at the expense of longer processing times and higher computational costs. 

Our main objective was to find the perfect balance on the Whisper spectrum and turn the model into a top-quality, fast and economically feasible transcription tool for enterprise clients.

The initial focus was on speed: the alpha version of our product powered by optimized Whisper Large-v2 achieved the unprecedented 10s per hour of audio mark. To put this in perspective, vanilla Whisper could take as long as 25-40 mins with the same model!

We then learned that the biggest pain point for our users when using speech-to-text was related to errors and hallucinations in transcription. Following this feedback, we have prioritized improving the accuracy of Whisper while remaining one of the fastest speech-to-text APIs on the market at 60 sec per hour of audio on average. 

Today, the quality of Gladia’s proprietary Whisper transcription engine is attributed to a hybrid architecture, where optimization occurs at all key stages of its end-to-end transcription process.

The resulting system operates as an ML ensemble, where each step is powered by one or several additional AI models. As a result of this multi-tier optimization, we’ve been able to achieve superior accuracy and speed to both open-source and API Whisper. 

Training data

Whisper has been trained on academic and web data, including closed captions that you can find on sites like YouTube. This means that the model is mathematically more biased towards phrases that have nothing to do with professional audio data.

To address this, we use a combination of ensemble technics mixing, NLP/NLU and audio models to make Whisper more sensitive to a wider range of business application. Our API has a proven track record of delivering top results in real-life use cases, including challenging audio environments (characterized by factors like background noise, overlapping speech, accents, etc.) and specialized domains like medical and legal transcription.


Hallucinations refer to the phenomenon where the ASR system produces transcriptions or outputs that include words or phrases that were not present in the original audio input. 

Powered by a predecessor of the iconic GPT-3 at the decoding phase, vanilla Whisper is infamously prone to hallucinations, occurring due to both internal (e.g. training data, model architecture, overfitting) as well as external factors (e.g. complex audio). As a result, Whisper’s output transcripts can be filled with errors, repetitions, and silent segments, which makes them impractical for further use and analysis.

Thanks to an ongoing model refinement and updates based on user feedback and real-world data, Gladia was able to eliminate practically all hallucinations. Our current transcripts are essentially error-free thanks to the latest proprietary model, Whisper-Zero. You can learn more about it on the dedicated landing page.

Audio intelligence features

The core function of Whisper ASR is speech-to-text transcription. Our API gives access to additional must-have features to make the model more suited for building advanced ASR-powered voice applications for your business. 

  • Real-time transcription, enabling businesses to transcribe audio and video as they speak during a call, conference, or live event. Contrary to asynchronous batch transcription, it is not available with any of the original Whisper models. Our real-time transcription achieves exceptional latency results compared to the market average. 
  • Speaker diarization, which allows to identify and separate speakers in a transcript. In addition to making transcripts more accessible, it unlocks user insights and is indispensable in multispeaker environments. Gladia’s proprietary diarization mechanism is powered by multiple models for state-of-the-art performance.
  • Word-level timestamps, which assigns a time stamp to every word identified in the transcript. The feature is especially valuable in the context of video editing and captions, as well as recording playback. Included in the core Whisper model is the phrase-level timestamp mechanism that is notoriously ripe with missing words and hallucinations. Our optimization enables highly precise word-time alignment reflected in seconds.
  • Translation, allowing to set a desired output language as you transcribe audio or video files. Whisper supports any-to-English translation and covers a total of 99 languages (62 with the API). Gladia API enables translation from any-to-any of the Whisper languages and has proven to have lower word error rates in most of them.
  • Code-switching, which makes it possible to produce an accurate final transcript in complicated audio where speakers switch between multiple languages. Base Whisper models were never designed to do that, but we made it possible.   

Usage limitations

Due to Whisper’s design limitation described previously, there are a number of input, output, and processing requirements that make it impractical to use at scale. 

On the input side, the upload file size is limited to 25MB and 30 seconds in duration maximum. When it comes to files exceeding that size, Whisper requires developers to split their own audio files into smaller chunks. With Whisper API, there’s also a limit on concurrent requests per minute.

Knowing that this can negatively affect the output quality and user experience, we have implemented a number of adjustments to the base model. Result: with Gladia, you can upload files as large as 500MB and as long as 135 minutes, with custom extensions possible upon request. More on how to do this in a dedicated tutorial.

Unlike Whisper, our API can process URLs and callbacks. We provide webhooks and support SRT and VTT output formats optimized for media captions. Gladia also automatically adds punctuation and casing. In short, you don’t have to worry about formats, sizes, and other input parameters – we took care of everything

Pricing and privacy 

Whisper API is available at $0.006 per minute. As mentioned above, the low price tag comes with serious usage limitations and the absence of other critical features. As to the open-source model, the total cost of ownership can reach from $300k to $2 million in budget per year. Here’s how we calculated that. 

Our pricing starts at $ 0.612 per hour, with a generous free tier of 10 hours/month, a flexible pay-as-you-go formula, and a volume discount available for enterprise users. While the vast majority of STT vendors charge per 15 seconds – resulting in roundups for those who consume less, especially with chatbots – we’re committed to charging per second precisely. As an EU-based company, we fully protect all client data too, in compliance with GDPR. More on our Pricing page.

Taking stock, Gladia is an all-batteries-included API designed to respond to the ever-growing needs of a modern enterprise – including the most stringent data protection and privacy standards. We took the best of Whisper and overcame its shortcomings to deliver a fast and responsive transcription and audio intelligence experience tailored to use cases as varied as virtual meeting platforms, podcast hosting and media captions, and call center operators

Below, you’ll find a complete comparison between vanilla Whisper ASR and Gladia speech-to-text API. To test our product in action, you can sign up for free below or request a demo.

Contact us

Your request has been registered
A problem occurred while submitting the form.

Read more

Case Studies

How VEED is streamlining video editing and subtitles with AI transcription

User-generated content has become a cornerstone of the internet-driven economy. As part of this shift, various platforms have emerged to provide easy-to-use tools to create high-quality video content in a matter of minutes — with AI transcription playing a foundational role in their product development.


How to build a speaker identification system for recorded online meetings

Virtual meeting recordings are becoming increasingly used as a source of valuable business knowledge. However, given the large amount of audio data produced in meetings by companies, getting the full value out of recorded meetings can be tricky.


Should you trust WER?

Word Error Rate (WER) is a metric that evaluates the performance of ASR systems by analyzing the accuracy of speech-to-text results. WER metric allows developers, scientists, and researchers to assess ASR performance. A lower WER indicates better ASR performance, and vice versa. The assessment allows for optimizing the ASR technologies over time and helps to compare speech-to-text models and providers for commercial use.