How to build a speaker identification system for recorded online meetings

Published on Jul 17, 2024

Virtual meeting recordings are becoming increasingly used as a source of valuable business knowledge. However, given the large amount of audio data produced in meetings by companies, getting the full value out of recorded meetings can be tricky.

For instance, a user may want to review action items set by their project manager during the last team call, or collect a quote from a particular speaker in a seminar. Most meeting platforms give limited solutions when it comes to this, leaving speakers with lengthy transcripts to go through.

Speaker identification is designed specifically for these use cases in mind. The feature relies on voice biometrics to distinguish and identify speakers in audio streams, providing organizations with more detailed insights into meetings.

This tutorial will guide you through constructing a speaker identification system (POC) for recorded video meetings, that can serve as a foundation for more advanced speaker-based features in your online meeting platforms and note-taking apps. You'll discover key methods and tools to recognize different speakers, enabling you to then creating meeting summaries, action items and more based on speaker metadata.

What is speaker identification, and how does it work?

In speech recognition, speaker identification is a process that identifies and distinguishes speakers based on their unique vocal characteristics. In this tutorial, we use speaker identification to analyze voice patterns and match them to known identities.

Our approach involves several key steps:

Speaker diarization: Using the pyannote library, we segment the audio recording into homogeneous parts, each associated with a specific speaker. This step is crucial as it allows us to separate the audio stream into individual speaker segments. To learn more about this, check our deep-dive into speaker diarization.
Speaker embedding extraction: We leverage a pre-trained encoder model from the speechbrain library to extract embeddings from the speech samples. These embeddings serve as compact representations of each speaker's voice characteristics.
Speaker identification: We use cosine similarity to match segment embeddings with pre-extracted speaker embeddings. By calculating the cosine distance (using scipy.spatial.distance.cosine), we can determine the identity of the speaker in each segment based on how similar the embeddings are to those of known speakers.

‍

This integrated approach of combining speaker diarization, embedding extraction, and cosine similarity allows for accurate identification of speakers, which enhances the management process and saves time.

For example, a developer can use this technology to summarize meeting recordings by identifying segments where the project manager is speaking enabling the developer to quickly review tasks assigned by the project manager without having to listen to the entire meeting.

Key benefits of speaker identification

Creating meeting summaries

Speaker identification can automatically create meeting summaries by identifying speakers individually. Essentially, it filters out unnecessary speakers, leaving only those who are relevant. This feature provides a detailed account of each participant's contributions, making it easier to review critical discussions.

Simplified speaker search

Speaker identification technology enables users to easily search for and retrieve specific segments of a meeting based on who is speaking. For instance, a team leader can quickly access instructions a project manager gives by directly navigating to those parts of the recording. This saves time and allows for better focus on relevant information.

Personalized meeting playback

Playing back meetings in a personalized way can be difficult when many people are involved. Speaker identification helps users tailor their playback experience to focus on parts where specific speakers are talking. This feature benefits participants who want to concentrate on particular discussions, making staying efficient and not overlooking important information more manageable.

Key difficulties of identifying speakers across meetings

Identifying who is speaking in different online meetings comes with several technical and practical challenges that can significantly impact how well the identification system works. Here are a few of the main difficulties we faced when developing the POC:

Resource management

Speaker ID systems, especially those analyzing multiple lengthy meetings, which need a lot of computing power. It's important to manage these resources well to keep the system running smoothly and efficiently. The main challenges involve dealing with large audio files and running complex deep-learning models without slowing down the system. As audio data grows, ensuring the system can handle heavier workloads without sacrificing performance becomes increasingly difficult. This means fine-tuning data processing pipelines and using better algorithms or more robust hardware.

Optimizing thresholds for speaker similarity

Setting thresholds for speaker similarity involves a delicate balance. If the system is too strict, it may fail to match the same speaker across different sessions; it is too lenient, and it might incorrectly identify different speakers as the same person. Finding the suitable threshold is critical to maximizing both the sensitivity (true positive rate) and specificity (true negative rate) of the system.

The variability in speech patterns, accents, and intonations among speakers adds complexity to setting universal thresholds. The system must be adaptable enough to accommodate these differences, which often requires sophisticated tuning and ongoing adjustments based on feedback and performance metrics.

Dealing with overlapping speech and background noise

During online meetings, it is common for more than one person to speak simultaneously or for interruptions and cross-talk to occur. This can make it challenging to separate and recognize individual speakers, which is crucial for the accuracy and usefulness of meeting transcripts. Online meetings can happen in different settings, some of which may have uncontrollable background noises that can disrupt the audio clarity. Finding ways to minimize the effects of noise on speaker identification accuracy continues to be challenging.

How to identify speakers across meetings

Prerequisites

As you start the tutorial, ensure you have the tools to follow along smoothly. Here, we will review the tools, libraries and packages you need and the setup required to create a speaker identification system.

‍1. Virtual environment (optional but recommended)

Setting up a Python virtual environment helps manage dependencies and keeps your projects organized without conflicts.

You can install virtualenv by executing the following command:

pip install virtualenv

Once virtualenv is installed, create a virtual environment by running:

virtualenv env

Depending on your operating system, activate the virtual environment using one of the following commands:

On Windows:

env\Scripts\activate

On macOS and Linux:

source env/bin/activate

With the virtual environment activated, you can install packages as usual using pip.

2. API keys and accounts

Hugging Face account and API key

Sign up at Hugging Face if you haven't already. This is necessary to access pre-trained models. To use the speaker diarization model, after signing up, visit this page to accept the terms of use. Obtain your Hugging Face API key from your account settings. This key is needed to download pre-trained models programmatically. Store it securely.

Google Colab

While optional, Google Colab is recommended for running notebooks in an environment with a free GPU. This is particularly useful for processing larger audio files more efficiently. Access it via Google Colab.

3. Libraries and packages

To build and integrate a speaker identification system, we will use several key libraries and packages:

a. Python

Make sure Python is installed on your system. You can download it from python.org.‍

b. Pyannote. audio

A powerful Python library for speaker diarization includes pre-trained models for segmenting and labeling speaker identities in audio files. You can install it using pip:

pip install pyannote.audio

c. SpeechBrain

SpeechBrain is an open-source, all-in-one speech toolkit allowing researchers to develop speech-processing algorithms. We will use it primarily for extracting speaker embeddings, which capture the unique characteristics of each speaker's voice.Since we need the latest features and updates, we will install SpeechBrain directly from its GitHub repository. Execute the following command:

pip install git+https://github.com/speechbrain/speechbrain.git@develop

d. Torchaudio

This library provides easy access to audio files and common transformations of audio data, essential for handling the input audio streams.

pip install torchaudio

e. SciPy

It is great for scientific calculations, such as computing distances between embeddings, which is essential for speaker identification.

pip install scipy

4. Audio data

Begin by collecting many different audio recordings of any online meeting or discussion that you want to study in order to create a good speaker recognition system that is appropriate for recorded events collected over the internet. These collected recordings shall form the basic data needed to segment and label individual speakers. It is necessary to get samples from a wide variety of audio sources for each speaker so that they can be used in teaching or fine-tuning your system. This will enable you to make very accurate audio embedding.

Speaker Identification POC: Step-by-step tutorial

Step 1: Loading the necessary models

First, we need to load the models to help us identify and diarize speakers from audio files. This involves setting up a GPU or CPU, loading a speaker embedding extraction model, and a speaker diarization model.

import torch
import torchaudio
from speechbrain.inference.speaker import EncoderClassifier
from pyannote.audio import Pipeline
from scipy.spatial.distance import cdist

# Check if CUDA is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load pre-trained model for speaker embedding extraction and move it to the device
# Note: You need to obtain an API key from Hugging Face to use this model.
classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", run_opts={"device": device})
classifier = classifier.to(device)

# Pre-trained model for speaker diarization
# Note: The speaker diarization model also requires an API key from Hugging Face.
diarization = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1",
                                        use_auth_token="YOUR_HUGGING_FACE_API_KEY")

Step 2: Extracting known speaker embeddings

Before recognizing speakers in a new recording, we need to have a reference of known speaker embeddings. Here, we extract embeddings from the sample audio files of well-known speakers.

known_speakers = []
known_speaker_ids = []  # To keep track of speaker IDs
for speaker_id, speaker_file in enumerate(["/path/to/SteveJobs.wav", "/path/to/ElonMusk.wav", "/path/to/NelsonMandela.wav"]):
    waveform, sample_rate = torchaudio.load(speaker_file)
    waveform = waveform.to(device)
    embedding = classifier.encode_batch(waveform)
    known_speakers.append(embedding.squeeze(1).cpu().numpy())  # Squeeze and move to CPU
    # Assign labels to each known speaker for identification
    if speaker_id == 0:
        known_speaker_ids.append("Steve Jobs")
    elif speaker_id == 1:
        known_speaker_ids.append("Elon Musk")
    elif speaker_id == 2:
        known_speaker_ids.append("Nelson Mandela")

Step 3: Diarization of the meeting recording

Now that we have our known speaker embeddings, we proceed to diarize an actual meeting recording to identify when each speaker is talking.

# Load and diarize the meeting recording
segments = diarization("/path/to/meeting_audio.wav")

Once you've run the diarization model on a different audio samples, the output is a color-coded graph that contains meeting segments corresponding to different speakers.

Step 4: Speaker identification process

In this crucial step, we identify speakers in each audio segment derived from the meeting recording. This process involves iterating over each diarized segment, extracting audio for that segment, and comparing it against our known speaker embeddings.

# Set a threshold for similarity scores to determine when a match is considered successful
threshold = 0.8

# Iterate through each segment identified in the diarization process
for segment, label, confidence in segments.itertracks(yield_label=True):
    start_time, end_time = segment.start, segment.end

    # Load the specific audio segment from the meeting recording
    waveform, sample_rate = torchaudio.load("/path/to/meeting_audio.wav", num_frames=int((end_time-start_time)*sample_rate), frame_offset=int(start_time*sample_rate))
    waveform = waveform.to(device)

    # Extract the speaker embedding from the audio segment
    embedding = classifier.encode_batch(waveform).squeeze(1).cpu().numpy()

    # Initialize variables to find the recognized speaker
    min_distance = float('inf')
    recognized_speaker_id = None

    # Compare the segment's embedding to each known speaker's embedding using cosine distance
    for i, speaker_embedding in enumerate(known_speakers):
        distances = cdist([embedding], [speaker_embedding], metric="cosine")
        min_distance_candidate = distances.min()
        if min_distance_candidate < min_distance:
            min_distance = min_distance_candidate
            recognized_speaker_id = known_speaker_ids[i]

    # Output the identified speaker and the time range they were speaking, if a match is found
    if min_distance < threshold:
        print(f"Speaker {recognized_speaker_id} speaks from {start_time}s to {end_time}s.")
    else:
        print(f"No matching speaker found for segment from {start_time}s to {end_time}s.")

We loop through each segment provided by the diarization process. Each segment typically represents a single speaker's speech during a specified time range. The corresponding audio slice is loaded and converted into a waveform for each segment.

The waveform is then passed through a pre-trained classifier model to extract its embedding. The extracted embedding is compared to embeddings of known speakers using cosine distance. The speaker whose embedding has the least distance to the segment's embedding is considered the match if the distance is below a set threshold.

If a matching speaker is identified, their name and speech duration are printed. If no match is found, a message indicates no matching speaker was found for that segment.

Step 5: Demonstration of speaker identification results

After completing the speaker identification analysis on these sample audio clips, the following results indicate the specific times and speakers involved in the conversation.

This output reflects the time segments in which each speaker was active in the meeting audio. The timestamps indicate the start and end times of the segments where each speaker was identified based on their voice embeddings. This detailed breakdown can be handy for applications such as generating automated meeting summaries, enhancing searcheability in audio archives, and providing personalized playback options in multimedia applications.

Conclusion

With remote work and digital meetings becoming more prevalent in professional settings, analyzing and overseeing audio data is becoming increasingly important. Managing and reviewing entire meeting recordings can be a significant time sink for users. This tutorial addresses this challenge by providing a targeted solution to streamline meeting management for online meeting recording and note-taking solutions.

The speaker identification system effectively identifies speakers within an audio file, showcasing the powerful capability of machine learning models in processing and analyzing audio data. It can be integrated into various applications, providing valuable insights and enhancements to audio content management and user interaction.

If you want to learn more about leveraging speaker identification in combination with top-tier speech-to-text for your LLM-based collaboration and meeting product, sign up or book a demo with us.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Getting started with Gladia: How to build with our STT API features

Whether you’re using Gladia’s speech-to-text (STT) API during a free trial or a long-term integration, you care about one thing: getting accurate, reliable transcriptions that work for your product and users.

Case Studies

How real-time transcription creates a competitive advantage in fintech

Fintech is evolving fast. Gone are the days of clunky logins and endless passwords. Today, users expect seamless account access, minimal friction and one-click payments.

Speech-To-Text

Real-time agent assist: Unlocking better call center services with speech-to-text

Customer service is evolving fast to meet new challenges. Today's clients expect immediate, accurate answers to increasingly specific queries and complaints. Meanwhile, contact centers need to reduce costs, improve efficiency, and maintain compliance…all while delivering exceptional experiences.