Real-time transcription powered by enterprise-grade Whisper ASR

Real-time transcription powered by enterprise-grade Whisper ASR

Real-time transcription powered by enterprise-grade Whisper ASR
Published on
Mar 2024

We’re happy to announce the general availability of Gladia’s live transcription, powered by a proprietary version of Whisper large-v2.

The feature utilizes WebSocket technology to transcribe audio and video in real-time at latency as low as 400 milliseconds and includes automatic language detection, code-switching, and word-level timestamp. 

Highly versatile in its applications, real-time transcription is an especially valuable feature for call bots, live streaming and broadcasting, and some virtual meetings, among others. We’re thrilled to deliver this state-of-the-art functionality to clients worldwide as part of our core pricing bundle.

In this blog, we’ll dive into the hidden mechanisms behind live transcription, explore its key challenges and use cases, and explain how to get started with live transcription using Gladia speech-to-text API powered by enhanced Whisper ASR for companies.

Understanding live transcription

In a nutshell, live transcription operates by capturing audio input from sources like microphones or streaming services, processing the audio using Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) technology, and providing a near-instant, continuous stream of transcribed text as the speaker talks. 

Owing to the need to transcribe speech in real time, the feature is ripe with technical challenges and requires a hybrid ASR / NLP model architecture to yield accurate results consistently.

Gladia API is based on OpenAI’s Whisper ASR. Because the original version of the model doesn’t support live transcription and WebSockets, our approach consisted of reengineering Whisper to add top-tier real-time transcription while keeping its core functionality and quality intact. Today, the quality of Gladia’s proprietary Whisper AI transcription engine is attributed to a hybrid architecture, where optimization occurs at all key stages of its end-to-end transcription process.

Speech recognition & NLP

First, we implement filtering or other pre-processing techniques to optimize the input audio for real-time processing.

Then, we need the system to accurately transcribe and understand speech. Our enhanced language detection, supporting 99+ languages, comes in handy here by allowing to automatically determine the language or dialect relevant to your application. We use various NLP techniques to enhance the accuracy of transcription by considering context, grammar, and semantics, as well as adding word-level timestamp metadata if needed.

Some use cases require custom guidance. Owing in part to Whisper’s attention mechanism, our API allows you to add contextual hints to help the system capture, identify, and extract specific information, such as names, dates, or technical terms. To make it further sensitive to context, we have developed a context reinjection technique, whereby the last sentence transcribed is used to anticipate the following ones.

Real-time processing

In a live transcription scenario, audio data is continuously generated as a user speaks. The ability to display the transcript as it’s being said with minimal perceptible delay is a key technical requirement for a satisfying end-user experience.

In ASR, the delay between the time a speaker utters a word or phrase and the time the ASR system produces the corresponding transcription result is known as latency.

The acceptable range for low latency is highly dependent on the specific needs of each application and end-user expectations. Our average latency is around 800 milliseconds, making it optimal for most voice assistants, communication platforms, industrial and media applications that require real-time control and response.

Latency range for Gladia real-time transcription API

To ensure a consistent, real-time flow of information, we rely on advanced streaming capabilities and use a combination of WebSocket and VAD technologies.

WebSocket is a protocol that facilitates bidirectional, real-time communication between a client (e.g., a web browser or application) and a server (where our API is hosted), ensuring consistent low-latency audio transmission and updates. Result: immediate access to live transcriptions for end users, with reduced network overhead and resource utilization on both the client and server sides. To learn more about setting up a WebSocket and using it with Gladia, check this Golang tutorial on the topic.

Voice Activity Detection (VAD) is a technology used to determine whether there is significant audio activity (speech) in an audio signal. It analyzes incoming audio data and identifies periods of speech and silence. End-pointing is an especially critical step in VAD, where the system identifies the moment when speech ends or transitions into silence or non-speech sounds to produce more accurate end results. We set a default of 300 milliseconds of “blank” in the voice that will trigger the transcription while allowing the customers to specify the duration in which the voice is being heard.

Combining WebSockets with VAD enabled us to build an efficient and responsive live transcription machine, delivering great results in real-life professional use cases in terms of both accuracy and latency.

Important to know 💡

What is the difference between partials and finals?

Partial recognition, or ‘partials’, involves transcribing portions of spoken words or phrases as they are received, even before the speaker has finished speaking the entire word or sentence. Transcribing speech “as you go” in this way allows to achieve a lower-than-average latency of around 400-500 milliseconds, as shown above – but at the expense of accuracy.

In contrast, final recognition, or ‘finals’, occurs when the ASR system has enough information to transcribe a complete word or phrase. It waits for a clear endpoint before providing a transcription and is powered by a bigger model that “rewrites” the script retrospectively. The delay may be slightly longer, but still provides a near-instant experience for the user.

When to use each?

Gladia API uses a hybrid approach that combines both partial and final recognition. Our system transcribes partial segments for real-time feedback and switches to final recognition when it has enough context to transcribe with high accuracy.

As a rule of thumb, we generally recommend prioritizing finals owing to greater accuracy. That said, partials can be incredibly useful for use cases where a real-time UI display is a must.

Scalability and load balancing

Owing to the fact that the bidirectional flow to the WebSocket is constant, the underlying infrastructure needs to be running 100% of the time, which makes it more expensive. 

To draw an analogy, audio processed via batch, or asynchronous, transcription can be compared to a ZIP file – since it’s compressed, its storage value for an API provider is significantly lower. With this kind of file, the so-called ‘real-time factor’ of execution is very small (e.g., 1/60 factor in the case of standard hour-long audio without diarization) compared to audio sourced from live streaming scenarios (where it becomes more like 1/1).

As such, the final key challenge of providing a live transcription API consists of finding ways to ease the load on the underlying infrastructure without imposing high costs on the client. To address this, a speech-to-text provider must design an internal infrastructure capable of scaling horizontally.

At Gladia, we implement special load-balancing strategies to distribute transcription requests across multiple servers and instances to handle high volumes of audio input – without making our clients bear an unreasonable cost. 

Use cases for live transcription

Complex as it may be on the technical side, live transcription is an incredibly valuable feature that helps to gain immediate access to speaker insights and enables a delightful user experience.

Real-time transcription is especially useful in scenarios where you need to react to what's being said directly, where very low latency or wait time is required. Conversational bots are another common application, as well as real-time captions for conferences in videos.

Here are some specific use cases we’ve worked with at Gladia so far:

  • Virtual and on-site meetings. Documenting time-sensitive meetings without having to wait for the transcript or generating real-time captions in international meetings. 
  • Customer support and call centers. Transcribing customer inquiries and agent responses in real-time to assist customer service representatives in providing more accurate and efficient support and conducting quality assurance.
  • Healthcare. Transcription during both in-person and remote medical consultations, as well as for emergency call services, for more effective time-allocation of the medical personnel’s valuable time. Can be used for medical conferences, too.
  • Finance. Providing the stakeholders with immediate access to up-to-date financial information in an industry where speed is key.
  • Media. Making use of the feature during live broadcasting and events for real-time subtitling and dubbing.

Getting started with Gladia live transcription API

To get started with live transcription, you can create a free account on and consult our developer documentation for more detailed guidance on its implementation. 

About Gladia

At Gladia, we built an optimized version of Whisper in the form of an API, adapted to real-life professional use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities and state-of-the-art features, including speaker diarization and word-level timestamps.

Contact us

Your request has been registered
A problem occurred while submitting the form.

Read more


Should you trust WER?

Word Error Rate (WER) is a metric that evaluates the performance of ASR systems by analyzing the accuracy of speech-to-text results. WER metric allows developers, scientists, and researchers to assess ASR performance. A lower WER indicates better ASR performance, and vice versa. The assessment allows for optimizing the ASR technologies over time and helps to compare speech-to-text models and providers for commercial use. 


OpenAI Whisper vs Google Speech-to-Text vs Amazon Transcribe: The ASR rundown

Speech recognition models and APIs are crucial in building apps for various industries, including healthcare, customer service, online meetings, and entertainment.


Best open-source speech-to-text models

Automatic speech recognition, also known as speech-to-text (STT), has been around for some decades, but the advances of the last two decades in both hardware and software, especially for artificial intelligence, made the technology more robust and accessible than ever before.