Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
ElevenLabs vs Gladia: speech-to-text comparison for voice AI builders
ElevenLabs vs Gladia comparison for voice AI builders. Compare STT accuracy, latency, pricing, and features for production agents. Get real-world accuracy metrics, total cost models, and technical specs to evaluate whether unified vendor stack or best-of-breed STT fits your pipeline.
Meeting bot speech recognition requires sub-300ms STT latency, real-time diarization, and code-switching for reliable transcripts. Production meeting bots fail when transcription infrastructure cannot handle multi-speaker overlap, language switching, and speaker attribution in real time.
Meeting transcription common mistakes: what meeting assistant builders get wrong
Meeting transcription mistakes that break production systems: crosstalk handling, diarization failures, and code switching issues. Learn how to architect STT pipelines that survive real world audio conditions, avoid silent WebSocket failures, and prevent cost model surprises at scale.
How to build a speaker identification system for recorded online meetings
Published on Jul 17, 2024
by Houssam Eddine Lachemat, PhD
Virtual meeting recordings are becoming increasingly used as a source of valuable business knowledge. However, given the large amount of audio data produced in meetings by companies, getting the full value out of recorded meetings can be tricky.
For instance, a user may want to review action items set by their project manager during the last team call, or collect a quote from a particular speaker in a seminar. Most meeting platforms give limited solutions when it comes to this, leaving speakers with lengthy transcripts to go through.
Speaker identification is designed specifically for these use cases in mind. The feature relies on voice biometrics to distinguish and identify speakers in audio streams, providing organizations with more detailed insights into meetings.
This tutorial will guide you through constructing a speaker identification system (POC) for recorded video meetings, that can serve as a foundation for more advanced speaker-based features in your online meeting platforms and note-taking apps. You'll discover key methods and tools to recognize different speakers, enabling you to then creating meeting summaries, action items and more based on speaker metadata.
What is speaker identification, and how does it work?
In speech recognition, speaker identification is a process that identifies and distinguishes speakers based on their unique vocal characteristics. In this tutorial, we use speaker identification to analyze voice patterns and match them to known identities.
Our approach involves several key steps:
Speaker diarization: Using the pyannote library, we segment the audio recording into homogeneous parts, each associated with a specific speaker. This step is crucial as it allows us to separate the audio stream into individual speaker segments. To learn more about this, check our deep-dive into speaker diarization.
Speaker embedding extraction: We leverage a pre-trained encoder model from the speechbrain library to extract embeddings from the speech samples. These embeddings serve as compact representations of each speaker's voice characteristics.
Speaker identification: We use cosine similarity to match segment embeddings with pre-extracted speaker embeddings. By calculating the cosine distance (using scipy.spatial.distance.cosine), we can determine the identity of the speaker in each segment based on how similar the embeddings are to those of known speakers.
This integrated approach of combining speaker diarization, embedding extraction, and cosine similarity allows for accurate identification of speakers, which enhances the management process and saves time.
For example, a developer can use this technology to summarize meeting recordings by identifying segments where the project manager is speaking enabling the developer to quickly review tasks assigned by the project manager without having to listen to the entire meeting.
Key benefits of speaker identification
Creating meeting summaries
Speaker identification can automatically create meeting summaries by identifying speakers individually. Essentially, it filters out unnecessary speakers, leaving only those who are relevant. This feature provides a detailed account of each participant's contributions, making it easier to review critical discussions.
Simplified speaker search
Speaker identification technology enables users to easily search for and retrieve specific segments of a meeting based on who is speaking. For instance, a team leader can quickly access instructions a project manager gives by directly navigating to those parts of the recording. This saves time and allows for better focus on relevant information.
Personalized meeting playback
Playing back meetings in a personalized way can be difficult when many people are involved. Speaker identification helps users tailor their playback experience to focus on parts where specific speakers are talking. This feature benefits participants who want to concentrate on particular discussions, making staying efficient and not overlooking important information more manageable.
Key difficulties of identifying speakers across meetings
Identifying who is speaking in different online meetings comes with several technical and practical challenges that can significantly impact how well the identification system works. Here are a few of the main difficulties we faced when developing the POC:
Resource management
Speaker ID systems, especially those analyzing multiple lengthy meetings, which need a lot of computing power. It's important to manage these resources well to keep the system running smoothly and efficiently. The main challenges involve dealing with large audio files and running complex deep-learning models without slowing down the system. As audio data grows, ensuring the system can handle heavier workloads without sacrificing performance becomes increasingly difficult. This means fine-tuning data processing pipelines and using better algorithms or more robust hardware.
Optimizing thresholds for speaker similarity
Setting thresholds for speaker similarity involves a delicate balance. If the system is too strict, it may fail to match the same speaker across different sessions; it is too lenient, and it might incorrectly identify different speakers as the same person. Finding the suitable threshold is critical to maximizing both the sensitivity (true positive rate) and specificity (true negative rate) of the system.
The variability in speech patterns, accents, and intonations among speakers adds complexity to setting universal thresholds. The system must be adaptable enough to accommodate these differences, which often requires sophisticated tuning and ongoing adjustments based on feedback and performance metrics.
Dealing with overlapping speech and background noise
During online meetings, it is common for more than one person to speak simultaneously or for interruptions and cross-talk to occur. This can make it challenging to separate and recognize individual speakers, which is crucial for the accuracy and usefulness of meeting transcripts. Online meetings can happen in different settings, some of which may have uncontrollable background noises that can disrupt the audio clarity. Finding ways to minimize the effects of noise on speaker identification accuracy continues to be challenging.
How to identify speakers across meetings
Prerequisites
As you start the tutorial, ensure you have the tools to follow along smoothly. Here, we will review the tools, libraries and packages you need and the setup required to create a speaker identification system.
1. Virtual environment (optional but recommended)
Setting up a Python virtual environment helps manage dependencies and keeps your projects organized without conflicts.
You can install virtualenv by executing the following command:
Once virtualenv is installed, create a virtual environment by running:
Depending on your operating system, activate the virtual environment using one of the following commands:
On Windows:
On macOS and Linux:
With the virtual environment activated, you can install packages as usual using pip.
2. API keys and accounts
Hugging Face account and API key
Sign up at Hugging Face if you haven't already. This is necessary to access pre-trained models. To use the speaker diarization model, after signing up, visit this page to accept the terms of use. Obtain your Hugging Face API key from your account settings. This key is needed to download pre-trained models programmatically. Store it securely.
Google Colab
While optional, Google Colab is recommended for running notebooks in an environment with a free GPU. This is particularly useful for processing larger audio files more efficiently. Access it via Google Colab.
3. Libraries and packages
To build and integrate a speaker identification system, we will use several key libraries and packages:
a. Python
Make sure Python is installed on your system. You can download it from python.org.
b. Pyannote. audio
A powerful Python library for speaker diarization includes pre-trained models for segmenting and labeling speaker identities in audio files. You can install it using pip:
c. SpeechBrain
SpeechBrain is an open-source, all-in-one speech toolkit allowing researchers to develop speech-processing algorithms. We will use it primarily for extracting speaker embeddings, which capture the unique characteristics of each speaker's voice.Since we need the latest features and updates, we will install SpeechBrain directly from its GitHub repository. Execute the following command:
d. Torchaudio
This library provides easy access to audio files and common transformations of audio data, essential for handling the input audio streams.
e. SciPy
It is great for scientific calculations, such as computing distances between embeddings, which is essential for speaker identification.
4. Audio data
Begin by collecting many different audio recordings of any online meeting or discussion that you want to study in order to create a good speaker recognition system that is appropriate for recorded events collected over the internet. These collected recordingsshall form the basic data needed to segment and label individual speakers. It is necessary to get samples from a wide variety of audio sources for each speaker so that they can be used in teaching or fine-tuning your system. This will enable you to make very accurate audio embedding.
Speaker Identification POC: Step-by-step tutorial
Step 1: Loading the necessary models
First, we need to load the models to help us identify and diarize speakers from audio files. This involves setting up a GPU or CPU, loading a speaker embedding extraction model, and a speaker diarization model.
Step 2: Extracting known speaker embeddings
Before recognizing speakers in a new recording, we need to have a reference of known speaker embeddings. Here, we extract embeddings from the sample audio files of well-known speakers.
Step 3: Diarization of the meeting recording
Now that we have our known speaker embeddings, we proceed to diarize an actual meeting recording to identify when each speaker is talking.
Once you've run the diarization model on a different audio samples, the output is a color-coded graph that contains meeting segments corresponding to different speakers.
Step 4: Speaker identification process
In this crucial step, we identify speakers in each audio segment derived from the meeting recording. This process involves iterating over each diarized segment, extracting audio for that segment, and comparing it against our known speaker embeddings.
We loop through each segment provided by the diarization process. Each segment typically represents a single speaker's speech during a specified time range. The corresponding audio slice is loaded and converted into a waveform for each segment.
The waveform is then passed through a pre-trained classifier model to extract its embedding. The extracted embedding is compared to embeddings of known speakers using cosine distance. The speaker whose embedding has the least distance to the segment's embedding is considered the match if the distance is below a set threshold.
If a matching speaker is identified, their name and speech duration are printed. If no match is found, a message indicates no matching speaker was found for that segment.
Step 5: Demonstration of speaker identification results
After completing the speaker identification analysis on these sample audio clips, the following results indicate the specific times and speakers involved in the conversation.
This output reflects the time segments in which each speaker was active in the meeting audio. The timestamps indicate the start and end times of the segments where each speaker was identified based on their voice embeddings. This detailed breakdown can be handy for applications such as generating automated meeting summaries, enhancing searcheability in audio archives, and providing personalized playback options in multimedia applications.
Conclusion
With remote work and digital meetings becoming more prevalent in professional settings, analyzing and overseeing audio data is becoming increasingly important. Managing and reviewing entire meeting recordings can be a significant time sink for users. This tutorial addresses this challenge by providing a targeted solution to streamline meeting management for online meeting recording and note-taking solutions.
The speaker identification system effectively identifies speakers within an audio file, showcasing the powerful capability of machine learning models in processing and analyzing audio data. It can be integrated into various applications, providing valuable insights and enhancements to audio content management and user interaction.
If you want to learn more about leveraging speaker identification in combination with top-tier speech-to-text for your LLM-based collaboration and meeting product, sign up or book a demo with us.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
ElevenLabs vs Gladia STT API Comparison 2026
Speech-To-Text
Meeting Bot Speech Recognition: STT API Guide 2026
Speech-To-Text
Meeting Transcription API Mistakes to Avoid in 2026
From audio to knowledge
Subscribe to receive latest news, product updates and curated AI content.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.