Accueil
Blog
How to summarize audio using Whisper ASR and GPT 3.5

How to summarize audio using Whisper ASR and GPT 3.5

How to summarize audio using Whisper ASR and GPT 3.5
Published on
Mar 2024

From online meetings to voice memos and media content, the amount of audio data generated by companies daily is as vast as it is valuable.

However, listening to hours of audio content to extract key information is impractical. To enhance productivity and optimize workflows, the capability to quickly generate brief summaries from audio data is crucial.

Luckily, recent advancements in the field of automatic speech recognition (ASR) and natural language processing (NLP) have become robust and accessible enough to make custom audio summarization possible for a range of projects.

In this step-by-step guide, we explain how to summarize audio using OpenAI's Whisper ASR for transcription and GPT 3.5 to generate prompt-based summaries. We will explain how the different components work, and, most importantly, show you how to build your summarization API. We will also present Gladia as an enterprise-grade alternative to the Whisper API for transcription and demonstrate how to use it.

What is speech recognition?

Speech recognition is the process of training an AI system to efficiently process spoken words in any language and translate them into a readable text format. Today, speech recognition systems such as Whisper ASR by OpenAI are trained using a sequence-to-sequence learning approach which consists of an encoder and a decoder block. 

More specifically, a standard automatic speech recognition (ASR) system today works by receiving audio input from any recorded audio file, after which the digital audio signals are processed to extract acoustic features (spectrograms) relevant to the speech recognition process. The resulting spectrograms from the audio signals are then passed into an acoustic model (encoder) which is responsible for mapping them to a sequence of words which is usually a direct translation of the input audio file. 

The encoder used in ASR is usually implemented using deep learning methods such as Long Short Term Memory (LSTM) networks or Convolutional Neural Networks (CNN). The decoder in the ASR system is responsible for predicting the most probable translation using the output of the encoder. To learn more, have a look at our recent article on how speech-to-text systems work.

Whisper ASR, state-of-the-art model

Whisper ASR was introduced as an open source model by OpenAI in 2022, and was made available as an API in 2023. The model was trained on 680,000 hours worth of labeled audio (audio and their corresponding translations) and can be utilized for tasks such as identifying the language spoken in an audio, transcription of an audio file into text, and translating an audio into another language.

With its transformer architecture based on sequence-to-sequence learning, is considered among the best open-source ASR engines ever build.

Gladia is powered by Whisper ASR, which was designed as a more optimized and feature-rich version of the model adapted to enterprise scale and needs.

Our API was developed to overcome the challenges and limitations of vanilla Whisper, such as hallucinations, long inference time, and usage limitations. 

In this tutorial, we will demonstrate how to use both original Whisper API and Gladia’s enterprise version thereof to transcribe text and then summarize it using GPT-3.5.

GPT 3.5 and text summarization 

Text summarization is the process of extracting and piecing together useful information from a long text to form a shorter text that retains the main discussion points or action items, while being easier to read and understand.

Types of summarization

When summarizing a text, there are two major methods involved [1]

  1. Extractive summarization: This type of summarization is synonymous with the above definition. In natural language processing, it involves identifying the most important sentences in the text that contribute to the main idea and piecing them together to form a shorter text.
  2. Abstractive summarization: Another technique that uses natural language to understand the general idea of a text. The text is then rewritten in a different and shorter way while still maintaining the original idea of the full text.

While a variety of LLMs are used to generate custom summaries automatically today, GPT 3.5 has emerged as one of the best tools to conduct both types of summarization.

What is GPT 3.5?

GPT 3.5 is a large language model (LLM) developed  by OpenAI, trained on billions of text examples. It can perform any task that involves generation, such as text generation, code generation, and text summarization. This model also can follow specific instructions given to it. For example, if you ask it to provide an abstractive summary of a meeting recorded in text format, it will follow those instructions.

Screenshot of Chat GPT with a prompt to summarize a meeting
Example of a ChatGPT prompt for summarizing a meeting

As you can see, the model follows the instructions we fed into it and returns a generated text that is accurate in detail concerning the instruction. This process of specifying instructions to the GPT 3.5 model to generate text in the way you want is known as prompt engineering. This will be useful to us later in this tutorial when we build our application. Also, check our blog for more examples of summarization prompts for virtual meetings.

Prerequisites

In this article, we will be building an audio summarization API and it is assumed that you already have a foundation in the Python programming language. First, we will learn how to transcribe audio using Whisper API and the alternative Gladia. Then we will learn how to summarize the transcript using the GPT-3 API after which we will learn how to build a unified endpoint for both transcription and summarization.

You will need the following to code along with the code examples shown in this tutorial:

1. An Integrated Development Environment (IDE) that supports Python e.g. VSCode

2. Python version - 3.11

3. Python libraries 

4. OpenAI API key

5. Gladia API key

If you have already tried to implement an audio summarization system similar to the one discussed in this tutorial using Whisper, GPT 3.5, and FastAPI, one problem you might have faced is a name attribute error that occurs when you try to upload the audio file using FastAPI’s UploadFile endpoint and pass it to the Whisper API. We will address this issue in the subsequent section below.

Note: Here's the GitHub Repository containing all the codes used in this tutorial.

How to transcribe audio using Whisper ASR Webservice

OpenAI provides two ways to access the pre-trained Whisper model. The first is a free, open-source package you can access directly from GitHub which allows for loading the model directly onto your system. However, it is worth noting that the computational expenses of the open-source package increase with the size of the model being used which leads us in this tutorial to use the paid API alternative which is a more efficient solution.

First, we must download the required packages using the pip package manager. The requirements.txt file is available on GitHub.


pip install -r requirements.txt

Once the package is successfully downloaded, you need to create an API key from OpenAI’s website. This is an essential step to access the model and get charged appropriately per usage. Once you have created the API key, pass it into a variable that can be used in the code. Note that it is best practice to use a .env file to store your API keys but for the sake of this tutorial, we will use variables.

The next step is shown in the code sample below. We name this Python file main.py and we import the necessary libraries to be used and the API keys.


import os
from openai import Audio
from fastapi import FastAPI, File, UploadFile
# Store the key in the api_key method from openai
openai.api_key = ''
gladia_key = ''

When trying to create endpoints using FastAPI, the first important step is to set up the server. To do this, you need to initialize an instance of the FastAPI class, which will be used to define the routes for each endpoint that is created.


# Initialize instance of FastAPI class
app = FastAPI()
# Initialize Audio object
audio_object = Audio()

We create an asynchronous function called ‘whisper’ with a route /whisper-transcribe/ and declare a parameter "file" with a UploadFile type. The reason we defined ‘UploadFile’ as the type of file parameter is to inform the app that the whisper function expects a file upload.


# Define route
@app.post(‘/whisper-transcribe/’, status_code=200)
def whisper(file: UploadFile):
    # some code here

The next step involves reading the uploaded file into a BytesIO class. You may be wondering why we did this instead of only reading the file using await file.read(). The transcribe method from the Audio() object expects the file being passed into the file parameter to have a name attribute which await file.read() doesn’t provide. To pass the file name of the uploaded file into the name attribute of the BytesIO class, you can read the file into the object and assign the filename to the name attribute.


@app.post(‘/whisper-transcribe/’, status_code=200)
def whisper(file: UploadFile):
    # Read the uploaded audio into BytesIO
    audio_file = io.BytesIO(await file.read())
    # Assign the filename to the name parameter of the BytesIO object
    audio_file.name = file.filename

Then, you can create an instance of Audio and pass the model name, file, and API key as parameters into the transcribe method, as shown in the code example below.


@app.post(‘/whisper-transcribe/’, status_code=200)
def whisper(file: UploadFile):
    # Read the uploaded audio into BytesIO
    audio_file = io.BytesIO(await file.read())
    # Assign the filename to the name parameter of the BytesIO object
    audio_file.name = file.filename
    # Initialize the transcribe method with the audio_file
    transcript = audio_object.transcribe(model='whisper-1',
                                             file=audio_file,
                                             api_key=openai_api_key)
    # Returns the final transcription
    return transcript 

Before testing out any new endpoints we create, make sure to add the following line of code at the last lines.


if __name__ == '__main__':
    uvicorn.run("main:app", port=6760, log_level="info")

We can test out the new endpoint created by opening a terminal in the directory where the main.py file is located. Once the terminal is launched, run this code below


python -m main

Proceed to edit the URL by adding /docs to open SwaggerUI which is an interface for testing endpoints.

Screenshot of the Whisper AI transcription endpoint
Whisper AI transcription endpoint

To proceed, select the /whisper-transcribe/ endpoint to reveal more information. Then, select ‘Try it out’. This will reveal an option to upload a file, or an audio recording in this case of the tutorial.

Transcription for audio input on Whisper endpoint (screenshot)
Transcription for audio input on Whisper endpoint

From the response body, we can deduce that the Whisper API can accurately transcribe the text from the audio file we uploaded to test the endpoint created.

How to transcribe audio using Gladia API

As an alternative to using open-source Whisper, Gladia provides a plug-and-play production-ready version of Whisper ASR, which can be used by simply using the requests library in Python.

Here’s a code sample on how to use Gladia API for transcription. The requests library is added to the main.py file to make request calls to Gladia’s API.


import os
import requests # the new import
from fastapi import FastAPI, UploadFile
from openai import Audio

Next, we define a route named gladia using FastAPI and assign it to an asynchronous function defined with a parameter of type UploadFile. When using our API, you can either pass in an audio file or a URL to where the file is located. 

In this tutorial, we will be passing in an audio file. To do this, our API expects three parameters: the name of the audio file, the audio file itself, and its content type. The UploadFile type provides access to the name using the .filename attribute and the content type using the .content_type attribute.


@app.post('/gladia/')
async def gladia(file: UploadFile):


    # read the uploaded file
    audio_file = await file.read()
   
    # set the filename
    filename = file.filename
   
    # set the content type
    content_type = file.content_type

Subsequently, we define a header parameter using the API key, and a file parameter using the three attributes: filename, file, and content type


@app.post('/gladia/', status_code=200)
async def gladia(file: UploadFile):
   
    # read the uploaded file
    audio_file = await file.read()
   
    # set the filename
    filename = file.filename
   
    # set the content type
    content_type = file.content_type   
    # Define API key as a header    
    headers = {
        'x-gladia-key': f'{gladia_key}'
    }
    # Declare filename, file, and content type as a file for the API
    files = {'audio': (filename, audio_file, content_type)}

Once everything needed to make a request is defined, we can send a request from which we will receive a JSON response. To retrieve the full transcription from the response, we need to loop through the response dictionary and retrieve each transcription.


@app.post('/gladia/')
async def gladia(file: UploadFile):
    # see code above
    # Pass the headers and files and send as a request
    response = requests.post('https://api.gladia.io/audio/text/audio-transcription/', headers=headers, files=files)
   
    # Retrieve each transcripted sentence defined by a full stop from the resulting dictionary
    sentences = [item["transcription"] for item in response["prediction"]]
   
    # Extract them from a list and join them together to form a full sentence
    transcript = " ".join(sentences)
   
    return transcript

To test the endpoint, simply run python -m main in your opened terminal.

The procedure for obtaining a response has been detailed above in ‘How to transcribe audio using Whisper ASR Webservice’.

Transcription for audio input on Gladia's endpoint (screenshot)
Transcription for audio input on Gladia's endpoint
In contrast to Whisper's API, which charges users for each request, Gladia offers a free plan that allows users to use our API for 10 hours per month without any restrictions on the many features we offer, such as batch transcription, speaker diarization, word-level timestamps, and live transcription.  

To learn more about our  free, pay-as-you-go and enterprise plans, please visit our pricing page.

Whisper ASR summarization using GPT 3.5

For the audio summarization system developed in this tutorial, we will be making use of OpenAI’s GPT 3.5 API for summarization.

To use the GPT 3.5 API to summarize the transcripts produced by Whisper ASR, we define an endpoint named /summarize/ and define a Python function with a string parameter transcript that expects an audio transcript.

Earlier in the tutorial, we discussed prompt engineering and how it can be used to instruct the model on a desired output for a corresponding input text. In this code, we give the AI several instructions to summarize the transcript while accounting for and correcting mistakes in the transcript. We also made sure to instruct the model to avoid adding unnecessary information to its generated text.

Next, we initialize a method of the ChatCompletion class from the openai package. In this method, we define the GPT 3.5 model in the model parameter and pass a list containing a dictionary with the keys: role and content. The role is set to system to let the model know that the content being provided is a system-level instruction.

According to OpenAI ‘a system level instruction is used to guide your model's behavior throughout the conversation’. Learn more about the parameters available here.


@app.post('/summarize/', status_code=200)
def summarize_gpt(transcript: str):
    prompt = f"""
    
    You are an AI agent given the sole task of summarizing an audio transcript which can either be of poor or good quality. The transcript generated from the audio file is given below.
    
    {transcript}.
    
    If the transcript is of poor quality or some words have been poorly transcribed, make sure to guess what the word is supposed to be and return a concise summary that contains all the important information from the transcript.
    
    Make sure that you only provide a summary of the conversation and nothing else. Don't add any additional words that aren't part of the summary.
    
    """
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo-0301',
        messages=[
            {"role": "system",
             "content": prompt}
            ],
        temperature = 0.0
        )
    return response.choices[0].message.content   

Here, we use the same meeting text used in the example of GPT-3 above.

Summary of a meeting using GPT 3.5 API (screenshot)
Summary of a meeting using GPT 3.5 API

How to create an automatic workflow of the system using FastAPI

An automatic system will be defined as one that allows for audio upload using a FastAPI endpoint, passes the audio into Whisper for transcription, and immediately passes the transcription into GPT 3.5 for summarization which is returned to the end user.

To do this, you will be making little adjustments to the code in the whisper.py, gladia.py, and summarize.py files. The adjustments to be made include the conversion of the endpoints in each of the files into standalone Python functions.

File structure

The first step before making the adjustments involves moving all 3 files into an utils folder, while you also create a main.py file outside the utils folder. 

Your file structure should look like this when done creating the file structure.

- whisper-summarizer

- main.py

- requirements.txt

The main folder, whisper-summarizer, should contain all subfolders and files. The requirements.txt file should look like the image below. The purpose of this file is to specify all the libraries or packages that need to be installed when working in a virtual local or cloud environment.

The main folder, whisper-summarizer, should contain all subfolders and files. The requirements.txt file should look like the image below. The purpose of this file is to specify all the libraries or packages that need to be installed when working in a virtual local or cloud environment.

Requirements file

To make our app fully functional, we would be adding standalone functions for each of the services being used e.g Gladia, Whisper, and GPT-3

Add a standalone function for Gladia to the main.py file

def gladia_transcribe(audio_file, filename, content_type):
   
    # Load environment variables from .env
    load_dotenv(find_dotenv(raise_error_if_not_found=True))
   
    # Access the API key
    gladia_key = os.environ["AUDIO_SUMMARY_GLADIA"]
   
    headers = {
        'x-gladia-key': f'{gladia_key}'
    }
   
    files = {'audio': (filename, audio_file, content_type)}
   
    response = requests.post('https://api.gladia.io/audio/text/audio-transcription/', headers=headers, files=files)
   
    return response.json()   
Add a standalone function for Whisper

def whisper_transcribe(audio_file):
    # Initialize Audio object
   


    # Initialize the transcribe method with the audio_file
    transcript = audio_object.transcribe(model='whisper-1',
                                            file=audio_file,
                                            api_key=openai_api_key)
   
    # Returns the final transcription
    return transcript
Add a standalone function for GPT-3

def summarize_gpt(transcript: str):
   
    prompt = f"""
    You are an AI agent given the sole task of summarizing an audio transcript which can either be of poor or good quality. The transcript generated from the audio file is given below.
   
    {transcript}.
   
    If the transcript is of poor quality or some words have been poorly transcribed, make sure to guess what the word is supposed to be and return a concise summary that contains all the important information from the transcript.
   
    Make sure that you only provide a summary of the conversation and nothing else. Don't add any additional words that aren't part of the summary.
   
    """
   
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo-0301',
        messages=[
            {"role": "system",
             "content": prompt}
            ],
        temperature = 0.0
        )
   
    return response.choices[0].message.content

Creating a single endpoint for an automatic workflow

The code samples we provided above show you how to convert the endpoints in the three files into standalone functions by removing the lines of the code that had to do with uploading and reading a file. This was done to create a unified endpoint that works for both the transcriber and summarizer and handles all forms of data input, such as audio upload or passing in a string.

To proceed with creating a unified endpoint, we will be updating the main.py file. As you have learned so far, we need to import the libraries we would be making use of as seen below.


import io
import openai
from fastapi import FastAPI, File, UploadFile
from fastapi.middleware.cors import CORSMiddleware
from openai import Audio
import requests
import uvicorn

The next step involves creating the server. A description is added to tell users about the purpose of the API and we have also titled the API as Whisper ASR Summarizer. We have also added a middleware using CORS for cross-communication between the Whisper ASR API and a frontend server.

The middleware accepts various parameters such as 

1. Origin, which is used to specify any protocols and ports that can communicate the Whisper ASR API. Here we set it to “*” which means all origins are allowed to communicate with the backend.

2. Method, which is used to specify the HTTP methods that are allowed for communication with the backend. Here we set it to ‘POST’ and ‘GET’.

3. Header, which is used to specify the HTTP headers that are allowed for communication with the backend. Here we set it to Content-Type since we’ll be dealing with file uploads.


description = """
The Whisper ASR Summarizer is an API that allows you to upload audio files
and automatically provides you with a summarized version of the audio in text format.


"""
# Creating the server
app = FastAPI(
    title='Whisper ASR Summarizer',
    description=description,
    summary='Summarize audio'
)


# Initializing parameters for middleware
origins = ["*"]
methods = ["POST", “GET”]
headers = ["Content-Type"]


app.add_middleware(
    CORSMiddleware,
    allow_origins=origins,
    allow_credentials=True,
    allow_methods=methods,
    allow_headers=headers
)

To transcribe using the Whisper ASR API, we first create an endpoint named /upload-audio-whisper/ and an asynchronous function. We then specify the file parameter to be of type UploadFile and add a description to explain what the parameter expects.

Next, we read the uploaded file, set its name with BytesIO, and pass it into the whisper_transcribe function to obtain a transcription. The transcript is then passed into the summarize_gpt function to obtain a summary, and both the transcript and summary are returned.


 @app.post('/upload-audio-whisper/')
 async def transcribe_summarize_whisper(file: UploadFile):
    audio_file = io.BytesIO(await file.read())
    audio_file.name = file.filename
    # call function to transcribe using whisper
    transcript = whisper_transcribe(audio_file)
    # call function to transcribe using gpt
    summary = summarize_gpt(transcript)
    return {
            'Transcript': transcript,
            'Summary': summary
    }

The steps for transcribing using Gladia are the same as above. First, we read the file. Then, we pass the uploaded file into the Gladia function. Next, we parse the output to retrieve the full sentences. Finally, we pass the resulting transcript into the GPT function to get a summary


@app.post('/upload-audio-gladia/')
async def transcribe_summarize_gladia(file: UploadFile):


    audio_file = await file.read()


    transcript = gladia_transcribe(audio_file, file.filename, file.content_type)
    # Combining all transcripted sentences
    sentences = [item["transcription"] for item in transcript["prediction"]]
    transcript = " ".join(sentences)


    summary = summarize_gpt(transcript)


    return {
            'Transcript': transcript,
            'Summary': summary
    }

You can view the full code here.

If you wish to deploy your Webservice, here is a tutorial on how to do so using Render.

Conclusion

In this tutorial, we have shown how Gladia and Whisper can be used to generate audio transcriptions, with GPT-3.5 used subsequently for different types of text generation, while leveraging FastAPI to build an API for audio summarization.

To learn about the many features Gladia provides for audio transcription, visit our developer documentation.

Footnotes

[1] Dutta, S., Das, A. K., Ghosh, S., & Samanta, D. (2022). Data analytics for social microblogging platforms. Elsevier.

Other useful resources related to the topic

About Gladia

At Gladia, we built an optimized version of Whisper in the form of an enterprise-grade API, adapted to real-life professional use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities and state-of-the-art features.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more

Speech-To-Text

Should you trust WER?

Word Error Rate (WER) is a metric that evaluates the performance of ASR systems by analyzing the accuracy of speech-to-text results. WER metric allows developers, scientists, and researchers to assess ASR performance. A lower WER indicates better ASR performance, and vice versa. The assessment allows for optimizing the ASR technologies over time and helps to compare speech-to-text models and providers for commercial use. 

Speech-To-Text

OpenAI Whisper vs Google Speech-to-Text vs Amazon Transcribe: The ASR rundown

Speech recognition models and APIs are crucial in building apps for various industries, including healthcare, customer service, online meetings, and entertainment.

Speech-To-Text

Best open-source speech-to-text models

Automatic speech recognition, also known as speech-to-text (STT), has been around for some decades, but the advances of the last two decades in both hardware and software, especially for artificial intelligence, made the technology more robust and accessible than ever before.