Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
How to implement advanced speaker diarization and emotion analysis for online meetings
Published on Sep 10, 2024
In our previous article, we discussed how to unlock some of that data by building a speaker diarization system for online meetings (POC) to identify speakers in audio streams and provide organizations with detailed speaker-based insights into meetings, create meeting summaries, action items, and more.
However, effective communication goes beyond words. A subtle non-verbal cue or underlying emotional state can reveal how something was said and entirely change the meaning.
Analyzing emotions from various voice interactions such as customer service calls, sales meetings, or online interviews can help unlock deeper insights to predict behavior, take data-driven actions based on expected behavior, and improve quality monitoring over time.
For services like contact centers or sales-focused meeting platforms, this can turn into increased sales reps' performance, personalized customer assistance, better understanding of customer satisfaction, and more.
In this tutorial, we give you building blocks for integrating advanced speaker-based summaries for online meetings using speaker diarization and emotion analysis. Let’s dive in.
What is speaker diarization and emotion analysis?
Advanced speaker diarization identifies and segments an audio recording into speech segments, where each segment corresponds to a specific speaker. It then detects changes in speaker identity and groups segments that belong to the same speaker, answering the question who spoke when.
Emotion analysis, as the name suggests, analyzes the emotional undertones of voice, and classifies them into categories such as approval, disappointment, excitement, and curiosity, answering the question how was that said. For emotion analysis in this tutorial, we’ll use a version of Whisper called Whisper-timestamped and Hugging Face emotion detection model.
Sentiment vs emotion analysis
Note: Emotion analysis is often confused with sentiment analysis. Sentiment analysis classifies information as positive, negative, or neutral. But it’s not always capable of identifying emotional nuances such as surprise or fear. Here's where emotion analysis comes in—it analyzes more complex emotions and undertones. Both sentiment and emotion analysis can be text or speech-based.
This POC has various use cases:
Corporate governance and compliance: Automates meeting transcriptions for audit trails and legal documentation in highly regulated sectors like finance and healthcare.
Educational webinars and online classes: Allows students to search transcripts by speaker and helps educators refine their methods through emotion analysis.
Customer support and service reviews: Analyzes customer support calls and performs sentiment assessment to improve staff training and customer satisfaction.
Conference and event summaries: Provides easy access to specific parts of conferences, with emotion analysis providing insights into speaker engagement and audience sentiment.
Project management meetings: Improves understanding of team dynamics and communication flow, helping with conflict resolution and project success.
Challenges of implementing advanced identification and emotion analysis
High computational requirements
Challenge: Processing extensive audio data for diarization, transcription, and emotion analysis requires significant computational resources.
Solution: Leveraging cloud computing resources or optimizing algorithms.
Speaker diarization and identification
Challenge: Achieving high accuracy in speaker diarization can be difficult, especially in noisy environments or with overlapping speech.
Solution: Enhancing audio preprocessing and using advanced machine learning models trained on diverse datasets can improve accuracy.
Privacy and data security
Challenge: Handling sensitive audio data involves significant privacy and security concerns.
Solution: Implementing robust security protocols and complying with data protection regulations.
Limited context
Challenge: The current emotion analysis model often misses out on contextual audio cues like tone, pitch, and energy.
Solution: Future enhancements will include models that analyze emotions directly from audio, improving the understanding of sentiments.
Real-time processing requirements
Challenge:Real-time transcription is essential but challenging due to the heavy computational requirements.
Solution: Implementing Voice Activity Detection (VAD) techniques can process audio segments immediately and help predict the speaker sentiment in real time.
How to implement advanced speaker diarization and emotion analysis
Here are the steps for implementing advanced speaker diarization and emotion analysis POC:
Create speaker embeddings: You need to upload audio samples of known speakers and create a unique digital signature for each participant based on their voice characteristics. These samples are then processed to generate distinct speaker embeddings and serve as a reference for speaker diarization.
Diarization to determine “who spoke when”: The audio file is analyzed to detect different speakers and divide the meeting into parts where each segment represents a single speaker's input.
Speaker diarization to attribute speech segments to corresponding speakers: Each audio segment is compared to the speaker embeddings to identify which segments correspond to which speakers. This comparison matches audio characteristics with known embeddings.
ASR transcription to convert speech within each segment into text: The transcription also includes timestamps that link each piece of text to its specific time in the audio file.
Navigate to Account settings and get your Hugging Face API key. Store it securely.
Optional: While optional,Google Colab is recommended for running notebooks in an environment with a free GPU. This is especially useful for processing large audio files efficiently.
3. Install libraries and packages
We’ll use several libraries and packages in this tutorial.
Pyannote.audio: Speaker diarization with pre-trained models for segmenting and labeling speaker identities. Install it using pip:
SpeechBrain: All-in-one, open-source speech toolkit enabling flexible speech technology experiments. Install it directly from the GitHub repository:
Torchaudio: Gives you access to audio files and transformations:
SciPy: Used for scientific computing; it helps in operations like computing distances between embeddings:
Hugging Face Transformers: A wealth of pre-trained models used for ASR and emotion analysis tasks:
Whisper-timestamped: An enhanced ASR model that offers precise transcription with timestamps, critical for synchronizing transcribed text with audio segments:
4. Audio data
Ensure you have audio recordings for analysis. You can use the audio library featured in this tutorial if you don't have your recordings.
Step-by-step tutorial
Step 1: Create speaker embeddings
Let’s start by creating unique speaker embeddings for known speakers:
Now we need to:
Load the audio files containing samples of known speakers and extract their embeddings.
Convert the audio files into waveforms and encode them into speaker embeddings using the previously loaded classifier model.
Assign meaningful labels to each known speaker for later reference.
Step 2: Diarization
Segment the audio file into different speaker segments to identify "who spoke when."
Step 3: Speaker diarization
Next we need to identify speakers in each segment of the recording. For each segment, we load the corresponding portion of the audio recording using torchaudio.load() and extract the waveform.
We then pass the waveform to the classifier model to obtain the speaker embedding using the encode_batch() method. The obtained embedding is compared with embeddings of known speakers using cosine distance.
The speaker with the minimum distance is identified as the speaker for the specific segment:
If the minimum distance is below a specified threshold, we print the speaker ID along with the start and end times.
If no matching speaker is found, we print a message indicating that no matching speaker was found.
Step 4: Segment transcription with whisper-timestamped
Whisper-timestamped is based on OpenAI Whisper. However, it predicts word timestamps and provides a more accurate estimation of speech segments compared to Whisper models.
A confidence score is assigned to each word and each segment.
Use the following snippet codes to assign the transcription segment to a right speaker.
Step 5: Emotion analysis
We’ll use a pre-trained model, "SamLowe/roberta-base-go_emotions", from the Hugging Face Transformers library. This model is trained to recognize a wide range of emotions from text inputs.
The analyze_and_append_sentiments function goes through transcription data, analyzes the sentiment of each segment, and assigns the results to the corresponding segments in the JSON data:
Here is the final result:
Wrap-up
The advanced speaker diarization and emotion analysis POC enhances the processing and analyzing of online meetings, making them more accessible, organized, and efficient.
However, this model has its limitations. It currently processes transcribed text without considering the rich audio cues such as tone, pitch, and energy that can alter the emotion of a spoken sentence — the same sentence spoken in a cheerful tone versus an angry tone can convey entirely different emotions.
In our next tutorial, we will show you how to overcome this limitation. You’ll learn how to integrate models that can analyze emotions directly from audio data and leverage auditory cues to provide a more nuanced and accurate analysis. Stay tuned!
About Gladia
Gladia provides a speech-to-text and audio intelligence API for building virtual meeting and note-taking apps, call center platforms, and media products, providing transcription, translation, and insights powered by best-in-class ASR, LLMs and GenAI models.