Building a song transcription system with profanity filter using Whisper, GPT 3.5 and Spleeter

Building a song transcription system with profanity filter using Whisper, GPT 3.5 and Spleeter

Building a song transcription system with profanity filter using Whisper, GPT 3.5 and Spleeter
Published on
Mar 2024

The inception of music streaming gained initial popularity in 1999 with the founding of Napster, one of the pioneering streaming platforms. Millions of songs were available to listen to and download for free through the platform using the internet. One no longer needed to buy pre-recorded tapes, go to live shows, or tune into radio stations to listen to music.

Over the years, numerous other streaming platforms emerged, achieving success in the wake of Napster, with notable examples including AccuRadio in 2000, Spotify in 2006, Deezer, and Amazon Music in 2007. These platforms focus on improving the listener's experience with prominent features such as high-fidelity sounds, music recommendations, playlist sharing, searching and discovering, etc.

This tutorial aims to build an end-to-end song transcription system that: a) takes in a musical audio file, b) separates the vocals from the audio, transcribes the vocals into text, and c) identifies whether the resulting transcription includes profanities or is free from them. The aim of the system is not only to detect explicit profanities but also to analyze the broader context of the lyrical composition to identify any implicit suggestions we intend to flag. 

Overview and prerequisites

Before following along with this tutorial, you need to have the following

1. An understanding of the Python programming language

2. An IDE that supports Python e.g VSCode

3. Python 3.10 installed on your system

4. An OpenAI API KEY

5. A Gladia API KEY

6. A 32GB RAM system (recommended) or a system with a minimum of 32GB VRAM available (this would come in handy when splitting the audio file using Spleeter).

Building the song transcription system with profanity filter

What is a profanity filter?

A profanity filter can be described as a system or a type of software that is designed to analyze all forms of content e.g. text, audio, and video for the presence of inappropriate language or content that may be considered offensive. 

Many songs are considered generally enjoyable to listen to despite, or even thanks to, featuring lyrical compositions that reference themes such as sex, gun violence, drug abuse, and the use of expletives. However, profanities in songs may not be appropriate for certain groups of listeners and contexts — hence the need for an effective filtering system. In this tutorial, we use a basic form of a profanity filter, which detects the presence of offensive speech (including the type of profanities). Note that more advanced filters can be deployed to flag, replace, or even remove such words. 

Key difficulties of building a song transcription system with profanity filters

Building a song transcription system, especially if it features profanity filters, comes with some technical challenges. The quality of transcription is affected by the overall quality and type of audio used, which in turn may hinder the filtering ability of the system as a whole. Here are the difficulties we encountered and took into account when designing our song transcription app.

  1. Sound effects: The problem with the use of sound effects such as autotune, reverb, delay, and distortion is the introduction of artifacts that can mask the acoustic cues on which speech recognition models rely for the transcription process. It can also reduce the clarity of the vocals, which inadvertently obscures certain vocal information important to the transcription.
  1. Background noise: During testing of the Spleeter package to separate the vocals from the instrumentals, we discovered that sometimes, the resulting vocals file still contained some background noise (usually from the instrumentals) which could hamper the transcription process and lead to less quality transcription. The process involved in tackling this problem was an easy one as Gladia API provides a parameter ‘toggle_noise_reduction’ to perform noise reduction on the file and enhance the clarity of the vocals.

Step 1: Separate vocals from instrumentals using Spleeter

A sound can be defined as a wave that propagates through various media, including solids, liquids, and gases. It is produced by a vibrating object and can be transformed into meaningful information through Automatic Speech Recognition (ASR).

The ASR receives the sound and generates an audio file, which is subsequently divided into multiple segments known as frames. It then converts the waveform of the sound into numerical representations, which are further analyzed using statistical models.

The Spleeter package is powered by a neural network that analyzes a numerical representation of the frequencies present in the audio (audio spectrogram), after which it separates it into different components such as vocals, drums, bass, and other instrumentals. 

It supports different types of separation such as:

i. 2 stems - separates the file into vocals and other instrumentals

ii. 4 stems - separates the file into vocals, drums, bass and other instrumentals

iii. 5 stems - separates the file into vocals, drums, bass, piano and other instrumentals

1. Extracting the vocals using Spleeter

The first step in this tutorial is to import all the necessary packages to be used for the proper functioning of our code.

import os
import requests
import warnings
from openai import OpenAI
from spleeter.separator import Separator

To extract the vocals from an audio file, we define a function ‘extract_vocals’ with two parameters path_to_file and path_to_output. The ‘path_to_file’ parameter expects the file path of the audio file we want to split while the ‘path_to_output’ expects the file path we want to save the resulting audio files after passing it through Spleeter.

If your system is GPU-enabled and has up to 16GB of GPU RAM, feel free to comment out the os.environ line. The line of code is only used to tell your system to ignore the GPU device, hence forcing the Spleeter library to make use of the CPU.

The next step involves the declaration of the Separator class instance. Here we pass a string ‘spleeter:2stems-16khz’ to the params_descriptor argument to tell the Spleeter module how we want our music file to be separated. The string denotes the separation of the music file into 4 components with an audio quality of 16khz. We set the multiprocess argument to False because Spleeter’s multiprocessing module is not fully supported on Windows and ends up in a permanent loop.

def extract_vocals(path_to_file: str, path_to_output: str) -> str:
    os.environ['CUDA_VISIBLE_DEVICES']=' ' # comment out
    separator = Separator(params_descriptor='spleeter:4stems-16kHz', multiprocess=False)

In the next step, we call a method ‘separate_to_file’ from the separator instance and we pass the file paths to it. This line of code separates the audio file into different components depending on the number of stems you pass to the ‘params_descriptor’ parameter in the Separator object.

def extract_vocals(path_to_file: str, path_to_output: str) -> str:
    os.environ['CUDA_VISIBLE_DEVICES']=' '
    separator = Separator(params_descriptor='spleeter:4stems-16kHz', multiprocess=False)
    separator.separate_to_file(path_to_file, path_to_output)

This creates a new file with the name of the song in the file path you passed to the ‘path_to_output’ parameter.

Folder created by Spleeter

Upon opening the folder, you will discover four distinct file components created by Spleeter. For the sake of this tutorial, we will only need the ‘vocals.wav’ file which contains the song's vocals without the instrumentals.

Resulting components in the Spleeter file

2. Selecting the vocals file

In this next step, we add a few more lines of code to retrieve the name of the song and combine it with the output path to get the path of the folder that holds the four components. Next, we create a loop to run through the ‘wav’ files in the path and remove all other files except for the vocals.wav file.

def extract_vocals(path_to_file: str, path_to_output: str) -> str:
    os.environ['CUDA_VISIBLE_DEVICES']=' '
    separator = Separator(params_descriptor='spleeter:4stems-16kHz', multiprocess=False)
    separator.separate_to_file(path_to_file, path_to_output)
    filename = path_to_file.split('/')[-1].split('.')[0]
    # combine output filepath and filename
    wav_path = os.path.join(path_to_output, filename)
    # Removes all other splits and leaves only the vocals in the directory
    for i in os.listdir(wav_path):
        if i!='vocals.wav':
            os.remove(os.path.join(wav_path, i))
            # Returns the music filename
    return filename

Step 2: Transcribe the vocals using Gladia Whisper API

Whisper is an automatic speech recognition model (ASR) created by OpenAI in 2022. This AI model is capable of receiving an audio file as an input and it provides the resulting transcriptions of the words spoken in the audio. Gladia is a speech-to-text API that builds upon the technology of Whisper ASR to solve the several challenges one might encounter when utilizing it. Gladia provides features such as real-time transcription, speaker diarization, translation, and summarization of the transcription. Gladia also has an advantage over the original Whisper ASR model during the transcription of large audio-length files allowing an audio file upload of up to 500MB and up to 135 minutes long.

Using Gladia’s API to transcribe the ‘vocals.wav’ file is easy. First, you need to define the headers for the API call in a dictionary. The ‘Content-Type’ must be set to ‘application/json’ and you will also need to pass in your Gladia API key here.

def get_lyrics(filename: str, audio_file):
    headers = {
        'Content-Type': 'application/json',
        'x-gladia-key': f'{gladia_key}'

Next, you will need to open the audio_file in ‘read binary’ mode. We use the splittext method from ‘os.path’ to get the file extension which we pass into the ‘audio’ key as a part of the expected parameters. The audio key expects the following parameters: the filename, the file itself, and the mime type of the audio file.

Although optional, we also set the ‘toggle_noise_reduction’ key to true to filter out any residual noise left over in the ‘vocals.wav’ file to enable Gladia to provide a more accurate transcription.

def get_lyrics(filename: str, audio_file):
    filename = filename
    headers = {
        'accept': 'application/json',
        'x-gladia-key': f'{gladia_key}'
    with open(audio_file, 'rb') as f:
        _, ext = os.path.splitext(audio_file)
        files = {
            'audio': (filename, f, f'audio/{ext[1:]}'),
            'toggle_noise_reduction': 'true'

        # Pass the headers and files and send as a request
        response ='',
        # convert response to json
        response = response.json()
        # Retrieve each transcripted sentence defined by a full stop from the resulting dictionary
        sentences = [item["transcription"] for item in response["prediction"]]
        # Extract them from a list and join them together to form a full sentence
        transcript = " ".join(sentences)
    return transcript

When we try to transcribe the ‘vocals.wav’ file gotten from the song ‘Many Men by 50 Cent’, the screenshot below shows the resulting transcript of the lyrics. From this, we can deduce that the Gladia API can transcribe the vocals with good accuracy.

Transcript generated by Gladia. Image blurred manually.

As a side note, Gladia Whisper API has several use cases for your business needs or personal use. For example, we can use Gladia’s Whisper API for captioning YouTube videos, social media captions, and real-time transcription with WebSockets.

Step 3: Prompt engineering GPT 3.5 as a profanity filter

Prompt engineering is the process of providing a large language model (LLM) with specific instructions to control how it generates texts or gives it a personality. In this tutorial, we need to ensure that GPT 3.5 functions perfectly as a profanity filter by analyzing the words and contextual meanings of the transcriptions to identify the profanities in the lyrics.

In the code sample below, we define a very detailed prompt with specific instructions on how GPT 3.5 should handle the transcriptions it receives and how it should structure its output. You can adjust the prompt to customize it for your use case e.g. if you only want to filter out curse words.

def determine_safety(lyrics):
    prompt = f"""
    You will be provided with the lyrics of a song that has been transcribed from the original audio file. There is a possibility that the transcription might not be entirely complete.
    Your duty is to study with intent the overall context of the song and the words used and look out for anything that promotes or denotes violence, sex, drugs, alcohol. After your analysis,
    you are to determine if the song is safe for a group of people who don't like profanites to listen to. You are not to generate any text in excess. If the song contains profanities whether explicit or implicit, your response should be 'Profanities detected, song promotes [write what the song promotes here]',
    otherwise your response should be 'No profanities detected, safe to listen to'.
    Note that the only criteria that should be used to determine the safety is the explicit or implicit mention of drugs, violence, sex and alcohol.
    Make sure to go over the context of the lyrics for subtle hints at any of the profanities (drugs, violence, sex and violence) and only give a response when you're sure of your answer.
    Make sure that your only responses are 'Profanities detected, song promotes [write what the song promotes here]' and 'No profanities detected, safe to listen to'.
    response =
            {"role": "system",
             "content": prompt}
        temperature = 0.25,
    return response.choices[0].message

From the screenshot below, we can observe that the model follows the format of ‘Profanities detected, song promotes [write what the song promotes here]’. A quick study of the song lyrics shows us that the song ‘Many men’ promotes violence and GPT 3.5 through prompt engineering can detect this.

Step 4: Creation of a pipeline to automate the workflow

Since all sub-systems of the profanity filter system we are building in this tutorial have been successfully tested, we create an automatic workflow for splitting, transcription, and classification using the code below.

def run():
    # define the file paths
    path_to_file =  "C:/path/other path/song.mp3"
    output = "C:/path/other path/output_file_name"

    # Function to extract the vocals using Spleeter
    vocal_filename = extract_vocals(path_to_file, output)

    # the path to the newly created folder for the separated music files
    new_audio_path = os.path.join(output, vocal_filename)

    # the path to the newly created vocals.wav file
    audio_file_in_path = os.path.join(new_audio_path, os.listdir(new_audio_path)[0])

    # Function to transcribe the vocals using Gladia Whisper API
    transcription = get_lyrics(filename = vocal_filename,
                            audio_file = audio_file_in_path)

    print(f'THE TRANSCRIPTION \n{transcription}')

    # Function to use GPT 3.5 as a profanity filter
    safe_value = determine_safety(transcription)


In this tutorial, we have demonstrated how to build a profanity filter for songs using a combination of tools: Spleeter, Gladia API, and GPT 3.5 API. 

We used Spleeter to split the audio files into four different components, Gladia to transcribe the vocals of the song, and prompt-engineered GPT 3.5 as a profanity filter. The results of the tutorial showed that each technology used in the system performed excellently with the prompt-engineered GPT 3.5 able to detect the profanities in the transcription correctly.

If you want to explore more hands-on guidee like this, see our tutorials on how to summarize audio files using Gladia Whisper API and GPT 3.5, or on building a YouTube transcription tool for auto-captioning.

About Gladia

At Gladia, we built an optimized version of Whisper in the form of an API,  adapted to real-life professional use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities, and state-of-the-art features.

Contact us

Your request has been registered
A problem occurred while submitting the form.

Read more


OpenAI Whisper vs Google Speech-to-Text vs Amazon Transcribe: The ASR rundown

Speech recognition models and APIs are crucial in building apps for various industries, including healthcare, customer service, online meetings, and entertainment.


Best open-source speech-to-text models

Automatic speech recognition, also known as speech-to-text (STT), has been around for some decades, but the advances of the last two decades in both hardware and software, especially for artificial intelligence, made the technology more robust and accessible than ever before.

Case Studies

How Gladia's multilingual audio-to-text API supercharges Carv's AI for recruiters

In today's professional landscape, the average workday of a recruiter is characterized by a perpetual cycle of administrative tasks, alternated by intake calls with hiring managers and interviews with candidates. And while recruiters enjoy connecting with hiring managers and candidates, there’s an almost universal disdain for the administrative side of the job.