Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Read more

Speech-To-Text

Building a meeting summarization pipeline: async STT + LLM in 5 steps

Building a meeting summarization pipeline with async STT and LLM in 5 steps: audio ingestion, API integration, and prompt engineering.

Speech-To-Text

Real-time latency for meeting transcription: latency budgets and live note-taking requirements

Real-time latency for meeting transcription requires measuring end-to-end delays across audio chunking, network routing, and rendering.

Speech-To-Text

Handling transcription hallucinations in meeting notes: detection and mitigation strategies

Handling transcription hallucinations in meeting notes requires confidence scoring, LLM validation, and async STT to catch errors.

How to implement advanced speaker diarization and emotion analysis for online meetings

Published on Sep 10, 2024
How to implement advanced speaker diarization and emotion analysis for online meetings

In our previous article, we discussed how to unlock some of that data by building a speaker diarization system for online meetings (POC) to identify speakers in audio streams and provide organizations with detailed speaker-based insights into meetings, create meeting summaries, action items, and more.

However, effective communication goes beyond words. A subtle non-verbal cue or underlying emotional state can reveal how something was said and entirely change the meaning.

Analyzing emotions from various voice interactions such as customer service calls, sales meetings, or online interviews can help unlock deeper insights to predict behavior, take data-driven actions based on expected behavior, and improve quality monitoring over time.

For services like contact centers or sales-focused meeting platforms, this can turn into increased sales reps' performance, personalized customer assistance, better understanding of customer satisfaction, and more.

In this tutorial, we give you building blocks for integrating advanced speaker-based summaries for online meetings using speaker diarization and emotion analysis. Let’s dive in.

What is speaker diarization and emotion analysis?

Advanced speaker diarization identifies and segments an audio recording into speech segments, where each segment corresponds to a specific speaker. It then detects changes in speaker identity and groups segments that belong to the same speaker, answering the question who spoke when.

Emotion analysis, as the name suggests, analyzes the emotional undertones of voice, and classifies them into categories such as approval, disappointment, excitement, and curiosity, answering the question how was that said. For emotion analysis in this tutorial, we’ll use a version of Whisper called Whisper-timestamped and Hugging Face emotion detection model.

Sentiment vs emotion analysis

Note: Emotion analysis is often confused with sentiment analysis. Sentiment analysis classifies information as positive, negative, or neutral. But it’s not always capable of identifying emotional nuances such as surprise or fear. Here's where emotion analysis comes in—it analyzes more complex emotions and undertones. Both sentiment and emotion analysis can be text or speech-based.

This POC has various use cases:

  • Corporate governance and compliance: Automates meeting transcriptions for audit trails and legal documentation in highly regulated sectors like finance and healthcare.
  • Educational webinars and online classes: Allows students to search transcripts by speaker and helps educators refine their methods through emotion analysis.
  • Customer support and service reviews: Analyzes customer support calls and performs sentiment assessment to improve staff training and customer satisfaction.
  • Conference and event summaries: Provides easy access to specific parts of conferences, with emotion analysis providing insights into speaker engagement and audience sentiment.
  • Project management meetings: Improves understanding of team dynamics and communication flow, helping with conflict resolution and project success.

Challenges of implementing advanced identification and emotion analysis

High computational requirements

  • Challenge: Processing extensive audio data for diarization, transcription, and emotion analysis requires significant computational resources.
  • Solution: Leveraging cloud computing resources or optimizing algorithms.

Speaker diarization and identification

  • Challenge: Achieving high accuracy in speaker diarization can be difficult, especially in noisy environments or with overlapping speech.
  • Solution: Enhancing audio preprocessing and using advanced machine learning models trained on diverse datasets can improve accuracy.

Privacy and data security

  • Challenge: Handling sensitive audio data involves significant privacy and security concerns.
  • Solution: Implementing robust security protocols and complying with data protection regulations.

Limited context

  • Challenge: The current emotion analysis model often misses out on contextual audio cues like tone, pitch, and energy.
  • Solution: Future enhancements will include models that analyze emotions directly from audio, improving the understanding of sentiments.

Real-time processing requirements

  • Challenge: Real-time transcription is essential but challenging due to the heavy computational requirements.
  • Solution: Implementing Voice Activity Detection (VAD) techniques can process audio segments immediately and help predict the speaker sentiment in real time.

How to implement advanced speaker diarization and emotion analysis

The diagram explaining steps for implementing advanced speaker diarization and emotion analysis.

Here are the steps for implementing advanced speaker diarization and emotion analysis POC:

  1. Create speaker embeddings: You need to upload audio samples of known speakers and create a unique digital signature for each participant based on their voice characteristics. These samples are then processed to generate distinct speaker embeddings and serve as a reference for speaker diarization.
  2. Diarization to determine “who spoke when”: The audio file is analyzed to detect different speakers and divide the meeting into parts where each segment represents a single speaker's input.
  3. Speaker diarization to attribute speech segments to corresponding speakers: Each audio segment is compared to the speaker embeddings to identify which segments correspond to which speakers. This comparison matches audio characteristics with known embeddings.
  4. ASR transcription to convert speech within each segment into text: The transcription also includes timestamps that link each piece of text to its specific time in the audio file.
  5. Emotion analysis to examine the tone: The transcribed segments are analyzed by a Hugging Face emotion detection model to predict emotions.

Ready? Let’s get started.

Prerequisites

1. Set up your virtual environment (optional but recommended):

Setting up a Python virtual environment helps you manage dependencies and keep your projects organized. You’ll need Python 3.6 or later:

Create virtual environment virtualenv env

Activate virtual environment

Windows:

macOS and Linux:

2. Get your Hugging Face API key

  1. Sign up to create an account at Hugging Face (if you haven't already) and access pre-trained models.
  2. Visit this page to accept the terms of use.
  3. Navigate to Account settings and get your Hugging Face API key. Store it securely.

Optional: While optional, Google Colab is recommended for running notebooks in an environment with a free GPU. This is especially useful for processing large audio files efficiently.

3. Install libraries and packages

We’ll use several libraries and packages in this tutorial.

Pyannote.audio: Speaker diarization with pre-trained models for segmenting and labeling speaker identities. Install it using pip:

SpeechBrain: All-in-one, open-source speech toolkit enabling flexible speech technology experiments. Install it directly from the GitHub repository:

Torchaudio: Gives you access to audio files and transformations:

SciPy: Used for scientific computing; it helps in operations like computing distances between embeddings:

Hugging Face Transformers: A wealth of pre-trained models used for ASR and emotion analysis tasks:

Whisper-timestamped: An enhanced ASR model that offers precise transcription with timestamps, critical for synchronizing transcribed text with audio segments:

4. Audio data

Ensure you have audio recordings for analysis. You can use the audio library featured in this tutorial if you don't have your recordings.  

Step-by-step tutorial

Step 1: Create speaker embeddings

Let’s start by creating unique speaker embeddings for known speakers:

Now we need to:

  • Load the audio files containing samples of known speakers and extract their embeddings.
  • Convert the audio files into waveforms and encode them into speaker embeddings using the previously loaded classifier model.
  • Assign meaningful labels to each known speaker for later reference.

Step 2: Diarization

Segment the audio file into different speaker segments to identify "who spoke when."

Step 3: Speaker diarization

Next we need to identify speakers in each segment of the recording. For each segment, we load the corresponding portion of the audio recording using torchaudio.load() and extract the waveform.

We then pass the waveform to the classifier model to obtain the speaker embedding using the encode_batch() method. The obtained embedding is compared with embeddings of known speakers using cosine distance.

The speaker with the minimum distance is identified as the speaker for the specific segment:

  • If the minimum distance is below a specified threshold, we print the speaker ID along with the start and end times.
  • If no matching speaker is found, we print a message indicating that no matching speaker was found.

Step 4: Segment transcription with whisper-timestamped

Whisper-timestamped is based on OpenAI Whisper. However, it predicts word timestamps and provides a more accurate estimation of speech segments compared to Whisper models.

A confidence score is assigned to each word and each segment.

Use the following snippet codes to assign the transcription segment to a right speaker.

Step 5: Emotion analysis

We’ll use a pre-trained model, "SamLowe/roberta-base-go_emotions", from the Hugging Face Transformers library. This model is trained to recognize a wide range of emotions from text inputs.

The analyze_and_append_sentiments function goes through transcription data, analyzes the sentiment of each segment, and assigns the results to the corresponding segments in the JSON data:

Here is the final result:

Wrap-up

The advanced speaker diarization and emotion analysis POC enhances the processing and analyzing of online meetings, making them more accessible, organized, and efficient.

However, this model has its limitations. It currently processes transcribed text without considering the rich audio cues such as tone, pitch, and energy that can alter the emotion of a spoken sentence — the same sentence spoken in a cheerful tone versus an angry tone can convey entirely different emotions.

In our next tutorial, we will show you how to overcome this limitation. You’ll learn how to integrate models that can analyze emotions directly from audio data and leverage auditory cues to provide a more nuanced and accurate analysis. Stay tuned!

About Gladia

Gladia provides a speech-to-text and audio intelligence API for building virtual meeting and note-taking apps, call center platforms, and media products, providing transcription, translation, and insights powered by best-in-class ASR, LLMs and GenAI models.

Follow us on X and LinkedIn.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more