Thinking of using open-source Whisper ASR? Here are the main factors to consider
Perhaps you’re a developer looking for an Automatic Speech Recognition (ASR) solution for the first time. Or an executive looking for more affordable, faster, more accurate alternatives to the mainstream speech-to-text solutions for your business. Where do you turn to?
Generally, you’ve got two options — build a solution in-house using open-source model, like OpenAI Whisper ASR, or pick a specialized speech-to-text API provider.
In this blog, we’ll compare the pros and cons of each approach, and provide you with a hands-on guide on how to make the best decision for your project and use case.
Bonus: a handy open source vs. API cheat sheet at the end!
Benefits of using OpenAI Whisper to develop your own ASR solution
The open source revolution in AI
The availability of open source code has been a major catalyst for the adoption of AI. Only a few years ago, it required a team of in-house specialists and computing resources to train AI models for relatively basic tasks. These days, however, engineers can simply connect to open source databases such as Hugging Face or GitHub and find the code they need to start building.
This is also true for ASR applications. Take Mozilla DeepSpeech, for instance, an open source ASR system available on GitHub that has been used for transcription services and voice assistants; or Kaldi, a widely used open source toolkit for speech recognition that provides a flexible and modular framework for building ASR systems. Kaldi has been used in many research projects and has also been adopted by several commercial speech recognition systems. Other popular open source tools include CMU Sphinx, Wav2Letter++, and Julius.
Of all of the above, open source Whisper was among the biggest breakthroughs in the field. Released by OpenAI in 2022, it has gained significant attention for its accuracy and versatility in speech recognition.
Its deep learning-based approach, fuelled by the power of the GPT-3.5 language model and sequence-to-sequence learning, has opened up a world of possibilities for commercial applications and use cases that rely on ASR technology – including in the previously uncharted multilingual domain.
Whisper empowers developers to create a diverse array of voice-enabled applications, ranging from transcription services and virtual assistants to hands-free controls and speech analytics.
Its open nature encourages collaboration within the developer community, promoting rapid innovation and customization to suit specific project requirements. Moreover, self-hosting is the only approach enabling full security and control over one’s data and infrastructure.
Based on an internal survey done among 225 product owners earlier this year, CTOs and CPOs, we found that about 40% opt for open source models, predominantly Whisper, for their STT solutions – which testifies to a clear value of this route for a range of business applications.
Yet, finding source code is only the start of the journey — adapting it to your specific use case is a whole different story, as it often requires additional fine-tuning and optimization. For companies who don’t have the time nor resources to achieve this, relying on open source alone may therefore not be the ideal move.
This is where the notion of the total cost of ownership comes in handy when deciding whether developing an ASR solution in-house using open source code is the right decision for your business and use case.
Limitations of Whisper ASR
Depending on their needs and use case, companies need to determine whether they have sufficient in-house AI and ML expertise to set up and maintain a model like Whisper in the long run. Otherwise, they risk starting something they cannot build, scale or fine-tune well enough to match their needs.
In a nutshell, there are three main limitations when it comes to building an in-house ASR solution using open source models such as Whisper.
- Open source models are limited. Open source models, however groundbreaking, can be quite inflexible. To adapt them to a specific use case, additional fine-tuning and optimization is needed. For instance, Whisper’s multilingual abilities do not extend equally to all languages and features, and translation is limited to from any-to-English. Overcoming this and other shortcomings requires the use of proprietary algorithms and/or additional open source models.
- Open source gets problematic at scale. Setting up and maintaining a neural network like Whisper at scale requires significant hardware resources. Whisper’s highest quality model, Large-v2, is highly intensive in both GPU and memory usage – not to mention, a degree of data science and engineering expertise required to make it production-grade, which goes far beyond that needed to train simpler machine learning models.
- Open source can be very costly. While the cost of running CPUs and GPUs is relatively affordable (from 0.2$ per hour), there’s much more that goes into building your own ASR solution in-house using open source software. Once you add up the cost of human capital (hiring at least 2 senior programmers, a data scientist, and a project manager) to the hardware and maintenance cost of self-hosting, your total cost of ownership (TCO) can easily add up to $300k- $2 million per year. Here’s how we estimated that.
So is Whisper worth it?
Yes and no. It all depends on your specific use case and business needs.
Do you simply want to quickly create a product demo or conduct research on AI speech recognition? Then Whisper might be perfectly adequate for you.
However, for more complex use cases, such as call centers offering customer support or global media companies transcribing large volumes of content, hosting Whisper may not be the best option, as you’ll need to divert significant engineering resources, product staff and overall focus away from your primary product to build the extra functionalities needed.
For instance, Whisper is only able to process pre-recorded audio, so if you need real-time speech processing, you’ll need to devote a lot of resources to optimize it. The model also requires developers to split their own audio files into smaller chunks when the audio file exceeds 25 MB — which can be quite a hassle and brings down quality. Beyond the most popular languages, its performance is limited and requires custom fine-tuning – same for industry-specific jargon.
So while they often seem cost-effective to acquire at the beginning, hosting an open source model like Whisper can easily end up being costly at scale when you take into account the TCO, as well as its inherent ‘design limitations’.
Any company with 100+ hours of audio to transcribe per month would quickly begin to suffer under the financial burden of footing the bill for an in-house team of experts, plus increased GPU usage.
Alternative route: Getting your ASR solution through an API
There is an alternative to going the open source route, namely: picking an API provider.
What are speech-to-text APIs?
APIs are cloud-based services that provide developers with pre-built tools and interfaces to convert spoken language (audio or video) into written text. These APIs offer a convenient way to integrate AI speech recognition capabilities into your apps and platforms without the need to develop and maintain an ASR system from scratch. In a nutshell, it’s an all-batteries-included deal.
Speech-to-text APIs work by leveraging machine learning algorithms and large-scale training data to recognize and transcribe spoken words.
They typically employ a combination of traditional and deep learning models, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs) and transformer-based models, to process audio input and generate text output – as well to perform the more advanced functions like summarization or sentiment analysis, often requiring generative AI models.
Benefits of APIs in ASR
Superior performance thanks to proprietary optimization
When discussing the benefits of APIs, we need to address the speed and quality tradeoff in ASR. While a large neural network may be extremely accurate, it also takes longer to compute, since the more computations and the more processing time (and GPU power) is needed. Conversely, a simple algorithm may bring quick results, but suffer from low accuracy.
By using a hybrid architecture that combines the best of both worlds, APIs strike a balance between speed and quality while offering customizable options. That way, API providers can offer cost-effective solutions that cater to a wide range of user requirements.
Thanks to a proprietary approach to model optimization, Gladia is among the commercial vendors that have significantly improved Whisper’s base performance, and made it accessible to companies with high volumes of audio at a fraction of the cost.
More specifically, at Gladia we were able to achieve superior performance with Whisper in terms of both latency and accuracy, increase the volume and variety of input and output files/sizes, and expand the model’s scope from base transcription to audio intelligence add-ons, as well as translation to languages other than English. You can try it for free here.
No AI expertise or infrastructure required
Unlike building an ASR solution in house using open source code, APIs are easy to use, allowing developers without AI expertise to access ready-to-use services with simple API calls, eliminating the need to delve into the intricacies of speech recognition algorithms and infrastructure setup.
Moreover, they can be scaled more easily, since APIs are hosted in the cloud and can handle a high volume of requests, allowing applications to scale effortlessly.
Note that it will take you roughly a year to be production-ready if you choose to build a holistic Audio AI solution in-house. That’s one year during which your competitors are launching offers and getting customers, putting them at a competitive advantage. With an API, you can derive value from AI-powered features from day one of implementation.
You also need to factor the cost of future updates into your overall budget. Given the current pace of AI, a new model will go obsolete in less than three years, requiring additional capital reinjection in terms of both software and hardware.
With APIs, however, when it comes to maintaining and updating the ASR solution — including model improvements, bug fixes, and feature enhancements — this is all taken care of by the provider, freeing up valuable time and resources among your in-house developers.
Moreover, thanks to their extensive training data, ASR APIs often support multiple languages and dialects out of the box, eliminating the need for additional language-specific training.
Hosting any type of advanced speech-to-text solution by yourself can be a lot more costly than opting in for a pre-packaged API. Key reason: the cost of human capital.
After all, proper hosting requires at least two senior software developers – with salaries ranging from $50,000 – $88,000 per year. More realistically, it will take a 'two-pizza team' including a data scientist and project manager, to sustain a full-scale operation. On top of that, self-hosting comes with a range of hardware and maintenance costs — full breakdown here.
In contrast, pay-as-you-go formulas offered by API providers can be (significantly) cheaper. Based on our research, commercial pricing starts with $0.26 per hour of transcription, and goes all the way up to $1.44 for the Big Tech.
While the degree of quality varies greatly with each provider, APIs are generally more effective when you’re looking to easily scale your transcription volume and reduce your time to market.
All in all, APIs offer several benefits for companies that lack hardware and/or AI expertise, but still want to embed audio AI features into their product. Having an external vendor doing the pre-integration for you will save you time and money, allowing you to focus on delivering value from day one.
Open Source vs API: Ultimate Comparison
Whether to go the open source route or pick an API provider ultimately boils down to 4 key factors: available budget, level of in-house expertise, security requirement, and the volume of audio / video data you need to transcribe.
If you’re currently deciding between building your own in-house ASR solution or purchasing an API, here’s a cheat sheet we made with the main pros and cons of each approach.
Want to learn more?
Here are some complimentary resources:
At Gladia, we built an optimized version of Whisper in the form of an API, adapted to real-life use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities and state-of-the-art features, including speaker diarization and word-level timestamp.
How to integrate live transcription API with Twilio to transcribe calls in real time.
Twilio, used by hundreds of thousands of businesses and more than ten million developers worldwide, can now integrate with our live transcription API. The integration makes it easier for users to natively transcribe any phone call in real time while using Twilio. With transcribed text at your disposal, you'll then be able to analyze, archive, and act upon voice data more effectively.
Best speech-to-text APIs in 2023
Speech-to-text (STT), also known as automatic speech or voice recognition, is a type of AI technology that recognizes human speech in audio or video and transcribes it into written output. In the form of an API, it can power a variety of applications, ranging from call bots to voice assistants to AI-powered virtual meeting platforms.
How to build a voice-to-text Discord both with Gladia real-time transcription API
Discord, the leading communication platform for gamers and communities, is designed for seamless communication with other users, be it through text channels, DMs, 1-1 calls or even collective voice channels.