Transcribing long audios with Whisper using Python and Gladia API

Published on Dec 8, 2023
Transcribing long audios with Whisper using Python and Gladia API

Whisper ASR model released by OpenAI is great for providing transcriptions from audio files but doesn’t come without challenges. In addition to high computational requirements and expenses, Whisper comes with a limit of 25 MB and 30 seconds in duration on input audio files, which usually requires splitting larger audio files into chunks to be transcribed.

This method is not only impractical and time-consuming but also reduces the quality of the resulting transcription, which poses a huge inconvenience for enterprise-grade projects. In this article, we explore Gladia speech-to-text API, powered by an optimized hallucinations-free version of Whisper, as a production-grade API alternative to the original model.

Whisper ASR limitations: Long audio

Released in open access in 2022, Whisper ASR was a truly remarkable achievement in the field of automatic speech recognition, which set a new standard for accuracy and multilingual capabilities. While it remains perfectly suitable for indie projects and academic research, the open-source model comes with a number of limitations that make it challenging to use at scale for ever-growing enterprise needs and applications.

Take Whisper’s input requirements. When going through OpenAI’s FAQ, we see the community raise issues with the audio upload size limit. One such user complained about receiving a size limit exceeded error despite uploading an audio file less than 25 MB, while another user stated that “Currently, the Whisper model only supports video files that are up to 30 seconds long [..].” With Whisper API, there’s also a limit on concurrent requests per minute.

How does Gladia address Whisper’s shortcomings?

Gladia provides an optimized version of OpenAI's Whisper that solves the key limitations of the original model. Gladia’s hybrid architecture uses an ensemble of machine learning models to optimize each step of the transcription process, which helps to eliminate the OpenAI Whisper hallucination, resulting in a more accurate and reliable transcription service. Additionally, Gladia offers several useful features not available with the original model, such as real-time transcription, speaker diarization, and code-switching.

We have, of course, also addressed the Whisper model’s file size limitation. With our API, enterprise users can now upload audio files up to 500MB in size and up to 135 minutes long, extendable upon demand. This eliminates the need to manually process input files, enabling a hassle-free experience for a company to transcribe multiple large audio or video files of any format. 

Unlike Whisper, our API can also process URLs and callbacks. We provide webhooks and support SRT and VTT output formats optimized for media captions, too. In short, you don’t have to worry about formats, sizes, and other input parameters – we take care of everything.

Overview and prerequisites

This tutorial is intended for developers who want to transcribe audio or video files of any size using Gladia's API. To follow along with this tutorial, you must have

1. A strong understanding of the Python programming language

2. An IDE that supports Python e.g VSCode, PyCharm

3. Python 3.11 installed on your computer

4. A Gladia API key

Please note that while this tutorial will focus on simplicity when handling the API, it is important to follow best practices in a production environment by storing your API key as an environment variable. This will help to keep your API key secure and prevent unauthorized access. Gladia also supports Javascript and PHP.

The full code used in this tutorial is located in this GitHub repository.

Setting up Gladia API

Features of Gladia API

The features Gladia API provides are as follows:

1. Real-time transcription: This feature utilizes webhooks to receive audio streams in real-time and then automatically returns an audio transcription. This helps businesses to easily take notes of what is being said during meetings and beyond. 

2. Speaker diarization: Audio recordings can feature one or more speakers, and this necessitates identifying and separating the speakers during transcription. Gladia’s API achieves speaker diarization easily through our proprietary diarization mechanism, which delivers state-of-the-art performance.

3. Word-level time stamp: Our API also provides a feature where each word in the resulting transcript is given an accurate time stamp, which can prove useful when editing videos and adding subtitles

4. Translation: With our API, you can easily receive your transcripts in any language you desire by simply setting a desired output language. Gladia API supports translation from any-to-any of the 99 supported languages with exceptional accuracy and lower word error rates in most of them.

5. Code-switching: Our API can easily handle difficult situations where speakers in an audio recording are conversing and switching between one or more different languages, providing accurate transcripts.

Registration and obtaining API credentials

The first step in this tutorial is to get your own Gladia API key, and you can do this by following the steps below.

1. Create an account at https://app.gladia.io

2. Once inside the Playground, go to Account

3. Select the API Keys header, add a short description, and generate an API key

Screenshot of Gladia's playground for API key generation with Whisper
Generating a key with Gladia

Transcription using Gladia

In this tutorial, we will guide you through the process of transcribing an audio file with two speakers, lasting one hour and having a file size of 60 MB. 

To begin, create a Python file. The name of the file can be anything you want, but in this tutorial, we will name it "main.py." Import the packages that we will be using, which are the os package for interacting with the operating system and the requests library for making requests to the Gladia API.


import requests
import os
# set your API key here. Note that you can make use of environment variables 
# for better security and to avoid exposing your API to the public
gladia_key = ''

Next, we define a Python function named audio_transcription with a parameter filepath of type string, which expects the path to the audio file to be transcribed. Inside this function, we also define a header parameter to hold the API you defined above.


def audio_transcription(filepath: str):
    		# Define API key as a header
    		headers = {'x-gladia-key': f'{gladia_key}'}

In the following line of code, we use the splitext method from the os package to split the input filepath into a filename and a file extension. We do this because during the preparation of data for API requests, the audio parameters, the filename, the audio file, and the content type(the file extension will be useful here).


# Split the filename and extension
    		filename, file_ext = os.path.splitext(filepath)

To prepare the necessary data for making the API request, we define a dictionary with several keys. The audio key is used to specify the metadata for the audio file, which are the filename, audio file, and content type. 

Also, due to our audio file containing 2 speakers, we set the toggle_diarization key to True and the diarization_max_speakers to 2 to force the model to recognize not more than 2 speakers from the audio file.

We have also specified the output format to 'txt' to allow for a full combination of the transcription in the API response.


with open(filepath, 'rb') as audio:

# Prepare data for API request
        		files = {
            		'audio': (filename, audio, f'audio/{file_ext[1:]}'),  # Specify audio file type
            		'toggle_diarization': (None, True),  # Toggle diarization option
            		'diarization_max_speakers': (None, 2),  # Set the maximum number of speakers for diarization
            		'output_format': (None, 'txt')  # Specify output format as text
        }


        		print('Sending request to Gladia API')
       
        		# Make a POST request to Gladia API
        		response =	requests.post('https://api.gladia.io/audio/text/audio-transcription/', headers=headers, files=files)
       
        		if response.status_code == 200:
            			# If the request is successful, parse the JSON response
            			response = response.json()
       
            			# Extract the transcription from the response
            			prediction = response['prediction']


            		# Write the transcription to a text file
            			with open('transcription.txt', 'w') as f:
                			f.write(prediction)
           
            			return response
           
        		else:
            				# If the request fails, print an error message and return the JSON response
            			print('Request failed')
            			return response.json()

Subsequently, we invoke the audio_transcription function and provide a file path for the audio file that we desire to transcribe. The audio file can be downloaded here.

From the code above, we have also set the function to automatically save the full transcription into a text file named transcription.txt.


audio_transcription('./podcast.mp3')

After running the code that saves the transcription in a text file, we can observe that Gladia can accurately identify the speakers in the audio file as well as provide accurate transcriptions without the OpenAI Whisper hallucination problem. You can view the full transcription here.

Transcript of Lex Fridman's podcast generated by Gladia
Transcript of Lex Fridman's podcast generated by Gladia

Note that although an MP3 audio file was used in this tutorial, it is important to note that Gladia can accept a variety of other media formats, as well as URLs to an audio/video file.

Conclusion

The original Whisper model from OpenAI requires splitting audio larger than 25 MB into chunks, which often results in lower-quality transcriptions. At Gladia, we have optimized the Whisper model with newer features while increasing the audio file limit to 500 MB for a more seamless experience. Our latest model, Whisper-Zero, addresses usage limitations, improves accuracy across languages, and more.

This tutorial has demonstrated how to transcribe long audio files using the Gladia API. We generated an API key, defined the features we wanted the model to use, made a request to the API, and saved the transcript to a text file. If the generated transcript is too long to read, please refer to this tutorial, which teaches you how to summarize audio files using Whisper ASR and GPT 3.5.

Feel free to experiment with other features availble with our API and customize the main.py file to suit your personal needs.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more

Speech-To-Text

Safety, hallucinations, and guardrails: How to build voice AI agents you can trust

As voice agents become a core part of customer and employee experience, users need to know these AI systems are accurate, safe, and acting within boundaries. That’s especially true for enterprise-grade tools, where a rogue voice agent can severely damage relationships and create major legal risks.

Case Studies

How Aircall cut transcription time by 95% with Gladia

The contact center is transforming. Traditionally defined by manual workflows, siloed data, and reactive customer service, today's Contact Center as a Service (CCaaS) platforms are embracing a new era—one driven by real-time AI and automation.

Speech-To-Text

How to measure latency in speech-to-text (TTFB, Partials, Finals, RTF): A deep dive

Latency can make or break a voice experience. Whether you’re building an agent that must stop speaking the moment a customer interrupts, or you’re captioning live content, you need a clear, reproducible way to measure how fast your STT really is, from first partial word to final transcript. 

Read more