Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Read more

Speech-To-Text

ElevenLabs vs Gladia: speech-to-text Comparison for voice AI builders

ElevenLabs vs Gladia comparison for voice AI builders. Compare STT accuracy, latency, pricing, and features for production agents. Get real-world accuracy metrics, total cost models, and technical specs to evaluate whether unified vendor stack or best-of-breed STT fits your pipeline.

Speech-To-Text

Meeting bot speech recognition: how real-time transcription powers automated meeting assistants

Meeting bot speech recognition requires sub-300ms STT latency, real-time diarization, and code-switching for reliable transcripts. Production meeting bots fail when transcription infrastructure cannot handle multi-speaker overlap, language switching, and speaker attribution in real time.

Speech-To-Text

Meeting transcription common mistakes: what meeting assistant builders get wrong

Meeting transcription mistakes that break production systems: crosstalk handling, diarization failures, and code switching issues. Learn how to architect STT pipelines that survive real world audio conditions, avoid silent WebSocket failures, and prevent cost model surprises at scale.

Meeting bot speech recognition: how real-time transcription powers automated meeting assistants

Published on March 25 2026
by Ani Ghazaryan
Meeting bot speech recognition: how real-time transcription powers automated meeting assistants

Meeting bot speech recognition requires sub-300ms STT latency, real-time diarization, and code-switching for reliable transcripts. Production meeting bots fail when transcription infrastructure cannot handle multi-speaker overlap, language switching, and speaker attribution in real time.

TL; DR: A meeting bot is only as reliable as its transcription layer. Real-time multi-speaker audio from Zoom or Teams typically targets sub-300 ms STT latency to keep captions and assistant responses feeling responsive. Managed STT APIs such as Gladia handle this infrastructure layer, including diarization, code-switching support, and enterprise compliance requirements. Generic WER benchmarks don't capture the speaker attribution errors that make bot-generated summaries untrustworthy. Self-hosting Whisper can add GPU and engineering overhead that compounds quickly.

For developers, the hard part of building a meeting bot isn't the LLM prompt that generates the summary. It's everything before it: capturing raw audio from conferencing platforms whose APIs were not originally designed for continuous data streaming pipelines, splitting that stream by speaker in real time, handling the moment someone switches from English to French mid-sentence, and doing all of it in under 300 milliseconds so the bot doesn't feel broken.

In practice, satisfaction drops measurably at 800 ms total response time. For interactive bots that surface real-time captions, flag action items mid-meeting, or respond to direct questions, your STT latency budget typically needs to stay consistently below 300 ms.

The architecture of automated meeting assistants

Before choosing an STT vendor, understand the full data path and where each stage can fail. A meeting bot's pipeline isn't a single API call. It's a sequence of components where failures compound and latency budgets are shared.

The pipeline moves through seven stages, each contributing to total end-to-end latency:

  1. Audio capture: The meeting platform makes raw audio available via SDK or streaming API.
  2. Stream ingestion: Your bot container receives the audio stream over WebSocket.
  3. VAD (voice activity detection): The pipeline filters silence before sending audio to the ASR engine.
  4. STT API call: Audio chunks are streamed to the transcription service.
  5. Transcript receipt: Structured JSON with speaker labels, language tags, and word-level timestamps comes back.
  6. LLM processing: The transcript feeds your dialogue manager or summarization model.
  7. Output: Notes, action items, or summaries surface in the UI.

Every stage carries a latency contribution, and the STT layer is usually the largest variable.

Audio stream ingestion and handling

Zoom's Real-Time Media Streams (RTMS) is a WebSocket-based API that gives your bot access to live audio, video, and transcript data as the meeting happens. The audio format in most headless bot implementations is PCM (Pulse-Code Modulation), delivered in raw chunks over the WebSocket connection.

When you run a headless bot in a Docker container using the Zoom Meeting Linux SDK, it produces both audio.pcm and video.yuv files once recording starts. Forwarding that PCM stream to your STT WebSocket without buffering failures requires careful connection management, particularly handling reconnects without dropping audio frames, because a transcript gap mid-meeting creates an unrecoverable hole in the output.

Multi-channel audio (where each participant's audio arrives on a separate channel) separates participant audio at the capture layer, before your STT engine runs.

Real-time transcription pipeline components

VAD sits between stream ingestion and your STT call. Its job is to determine whether audio frames contain speech before passing them to the ASR engine, which reduces unnecessary API calls on silence-heavy meeting audio and lowers cost.

Once speech is detected, the STT engine runs in streaming mode. Streaming STT produces partial transcription hypotheses as audio arrives, letting downstream components begin processing before speakers finish their sentences, with partial results within 100-200ms of speech onset. The final transcript, with diarization labels and language tags attached, arrives in structured JSON with word-level timestamps you can pass directly to your LLM. Post-processing normalization (formatting numbers, punctuation, proper noun correction) should happen in the STT layer, not in a separate pipeline stage your team has to build and maintain.

Critical challenges in meeting transcription

Meeting audio is not dictation audio. A single speaker reading prepared text into a clean microphone is a solved problem. Multi-speaker, multi-accent, multi-language audio in a video call with variable network quality is not. The following three challenges account for most production failures in meeting bot deployments, and each one compounds the others.

The impact of latency on user experience

The 300ms threshold isn't arbitrary. Humans perceive conversational delays beyond 300ms as unnatural, and exceeding that threshold makes a system feel broken, regardless of how good the summary is at the end of the meeting.

The governing metric is Real-Time Factor (RTF): the ratio of processing time to audio duration. An RTF below 1.0 means the engine processes audio faster than real-time. For a live meeting bot, you need RTF well below 1.0 to leave headroom for LLM processing and network round-trips.

In practice, satisfaction drops measurably at 800ms total response time. For interactive bots that surface real-time captions, flag action items mid-meeting, or respond to direct questions, your STT latency budget needs to stay consistently below 300ms.

Handling multi-speaker overlap and diarization

Speaker diarization is the partitioning of audio by speaker identity. In meeting terms, it answers: who said what, and when.

The difficulty is the cocktail party problem. In a six-person meeting, speakers interrupt each other, talk over each other, and produce overlapping segments that don't resolve cleanly into discrete turns. When a system fails to segment these correctly, wrong speaker attribution errors propagate through every downstream output built on that transcript.

The business impact is direct. Diarization errors corrupt insights in ways that aren't obvious until a user acts on bad information. Consider a product meeting where the lead engineer says "We need to deploy to production by Friday." If diarization misattributes that to a designer, the automated action item system assigns a deployment task to someone without the necessary permissions. The bot hasn't just made an error. It's created a production risk.

The standard evaluation metric is Diarization Error Rate (DER), which measures total time with diarization errors, combining false alarm speech, missed speech, and speaker confusion. A DER of 15% means the system incorrectly labels 15% of total audio time across all error types, which makes meeting notes unreliable enough that users stop trusting them. Diarization must run in the stream, not in a post-processing step. Post-processing diarization means real-time captions carry no speaker labels, which eliminates the core use case for most meeting bot features.

Multilingual support and code-switching

Code-switching (alternating between two or more languages within a single conversation) is increasingly common in multilingual workplaces, and handling it requires models specifically trained for it, not just multilingual vocabulary.

Generic STT benchmarks rarely reflect what real meetings look like. So we built an open benchmark to reflect the real-world audio: multiple speakers, interruptions, accents, domain vocabulary, and background noise, the kinds of conditions that actually break speech systems. We also open-sourced the benchmarking methodology  for you to reproduce the evaluation, run it against your own audio, and see how different systems perform under realistic conditions. Because vendor-reported WER on clean read-speech datasets won’t tell you how a system behaves on a 45-minute product review call with five people interrupting each other.

A bilingual support team produces sentences like "Can you send me the reporte by EOD?" A standard English ASR model either forces everything into English or hallucinates a substitution. Intra-sentential switching requires a unified multilingual model architecture to keep a single stream coherent when a speaker changes language mid-phrase.

The performance gap is significant. One study reports WER for code-switched speech ranging from 10% to 65% across different model architectures and language pairs. At the high end of that range, the transcript is unusable. The failure mode in production is typically invisible until it isn't: English-language accuracy looks acceptable in staging, non-English user segments churn quietly, and you discover the gap through support tickets rather than metrics.

Evaluating transcription quality: beyond standard WER

Word Error Rate (WER) is the standard metric: substitutions plus deletions plus insertions, divided by total reference words. The WER measurement guide covers evaluation methodology in detail. For meeting transcription, WER is necessary but insufficient.

The core problem is that WER treats all word errors equally. Misrecognizing "two" as "to" and misrecognizing "accept" as "except" both count as single-word errors, but the latter changes sentence meaning entirely. WER also doesn't capture speaker attribution errors at all.

Three additional metrics matter for meeting transcription quality:

  • Diarization Error Rate (DER): Measures speaker attribution accuracy independently of word accuracy.
  • Word-level Diarization Error Rate (WDER): A word-level diarization accuracy metric that governs action item attribution quality directly.
  • Entity-level accuracy: Proper nouns and technical terms need evaluation beyond standard WER. Jaro-Winkler for proper nouns allows partial credit for phonetically similar transcription errors, more accurately reflecting the semantic impact of name recognition failures.

Generic STT benchmarks rarely reflect what real meetings look like. So we built an open benchmark to reflect the real-world audio: multiple speakers, interruptions, accents, domain vocabulary, and background noise, the kinds of conditions that actually break speech systems.  We also open-sourced the benchmarking methodology  for you to reproduce the evaluation, run it against your own audio, and see how different systems perform under realistic conditions. Because vendor-reported WER on clean read-speech datasets won’t tell you how a system behaves on a 45-minute product review call with five people interrupting each other.

Build vs. buy: self-hosted Whisper vs. managed APIs

Self-hosted Whisper looks cheap at first pass. The model is open-source and inference cost at low volume is genuinely low. The real cost appears when you add the infrastructure required to make it production-grade for real-time meeting audio.

Dimension Self-hosted Whisper Gladia API
GPU infrastructure From $276/month for a dedicated instance Included in per-second billing
Real-time diarization Requires separate service (Pyannote or similar) Built into streaming response
Code-switching Not supported natively 100+ languages with code-switching
Maintenance overhead Ongoing (GPU scaling, model versioning, stability) Managed by vendor
Compliance DIY (SOC 2, GDPR require your own audit) SOC 2 Type 2, GDPR, HIPAA certified

Whisper's production limitations include hallucinations and limited real-time functionality that require substantial engineering work to address at scale. Adding a production-grade diarization service (Pyannote, for example) is a separate integration, a separate maintenance surface, and additional cost. Cloud infrastructure for self-managed setups runs $276 to $1,500 per month for typical deployments, with enterprise-scale workloads pushing costs to $1,500 to $5,000 per month or more, not counting the engineering time to build and maintain it.

The break-even calculation is more nuanced than it appears. At $0.006 per minute via managed API, 500 hours costs $180 per month versus approximately $276 per month for a minimum dedicated GPU instance. Self-hosting becomes rational only when your volume and infrastructure investment can sustain dedicated GPU capacity at significantly higher utilization than most teams achieve. For the majority of teams building meeting bots, the managed API wins on total cost of ownership until you're running very high volumes with a dedicated DevOps function.

How Gladia powers reliable meeting bots

Meeting bots fail when transcription infrastructure can't deliver sub-300ms latency, in-stream diarization, and code-switching support across global teams. We built our real-time transcription API to solve exactly this infrastructure problem: streaming WebSocket connections with speaker-attributed transcripts, code-switching across a supported subset of 100+ languages, and all audio intelligence features at a single per-hour rate

Our diarization feature organizes transcripts into segments by speaker, with support for mono, stereo, and multi-channel files. Speaker labels are assigned at the word level in the final transcript, rather than in real time, so your LLM receives fully attributed text after processing.

For multi-channel audio where Zoom or Teams separates participant audio by channel, our diarization uses unique channel detection to identify and avoid transcript repetitions when the same speaker appears in multiple channels. The system automatically selects the appropriate diarization method for your audio and produces speaker-attributed transcripts. Our LiveKit integration supports multi-participant streams, reducing WebSocket connection management complexity for platforms built on LiveKit infrastructure.

"Gladia deliver real time highly accurate transcription with minimal latency, even across multiple languages and accents, The API is straightforward and well documented, Making integration into our internal tools quick and easy." - Faes W. on G2
"Excellent multilingual real-time transcription with smooth language switching. Superior accuracy on accented speech compared to competitors. Clean API, easy to integrate and deploy to production." - Yassine R. on G2

We position ourselves as your infrastructure partner. We handle the "hearing" layer so your team can focus on the "thinking" layer. Unlike vendors that release their own meeting assistant products on top of the same API infrastructure, our role is fixed as the transcription layer. Your product architecture, your meeting bot features, and your LLM choice sit on top of a transcription layer without a competing product interest.

Real-time diarization and speaker identification

Our diarization featuresegments transcripts by speaker, with support for mono, stereo, and multi-channel files. Speaker labels are assigned at the word level in the final transcript, rather than in real time, so your LLM receives fully attributed text after processing.

For multi-channel audio where Zoom or Teams separates participant audio by channel, our diarization uses unique channel detection to identify and avoid transcript repetitions when the same speaker appears in multiple channels. The system automatically selects the appropriate diarization method for your audio and produces speaker-attributed transcripts. Our LiveKit integration supports multi-participant streams, reducing WebSocket connection management complexity for platforms built on LiveKit infrastructure.

"Gladia provides a highly accurate real-time speech-to-text solution for high volumes of support and service calls. Latency is low and accuracy high, even for numericals." - Verified User in Financial Services on G2

Enterprise security and data governance

Meeting audio contains sensitive information: unreleased product plans, financial projections, customer data, personnel discussions. Your STT provider's data handling posture will block enterprise sales cycles if you don't resolve it before legal review.

We maintain SOC 2 and HIPAA compliant status, audited on security, availability, confidentiality, processing integrity, and privacy. We default to GDPR compliance as a European provider based in France, with other geographies available per customer request for specific data residency requirements.

The no-retraining default applies to all Pro and Enterprise plan customers without requiring a separate contractual clause or opt-out process. Free plan data policy subjects free users to data use for model training. At Pro and Enterprise tiers, we never use your audio to train our models. This is documented in our privacy notice and reviewable before contract signature.

Our audio intelligence features available in the streaming response include sentiment analysis, named entity recognition, and custom vocabulary, all at the base per-second rate with no add-on pricing. The Gladia Playground walkthrough lets you validate audio quality and diarization output before touching the API.

"It's an incredible fast model. We are using the speech recognition model and it's unbelievably good for single or multi-language detection. We've integrated many models into our platform at Line 21, but Gladia is definitely in the top." - Paul B. on G2

For teams processing meeting audio at scale across sales workflows, the Attention case study covers how that pipeline performs in production.

Get started with 10 free hours and have your integration running in less than a day. Explore the meeting bot architecture and pricing model at your target volume through hands-on testing.

Frequently asked questions

What latency should I target for a real-time meeting bot?

Target sub-300ms for the STT layer alone, leaving budget for LLM processing and network round-trips. Satisfaction drops at 800ms total response time, which is the practical upper bound for interactive bot features.

What audio format does Gladia's real-time API accept?

The API accepts WAV/PCM at 16 kHz sample rate, delivered as base64-encoded chunks over WebSocket. Our getting started documentation covers all supported encodings and sample rates.

Does diarization work in real time or only in post-processing?

Diarization can run in the stream and attach speaker labels in the real-time transcript. Post-processing produces a higher-accuracy final transcript, but the real-time output is usable for live captioning and mid-meeting action item surfacing.

When does self-hosting Whisper become cost-effective vs. a managed API?

At $0.006/min via managed API, 500 hours costs about $180/month versus approximately $276/month minimum for a dedicated GPU instance. Self-hosting only wins at high volumes where dedicated GPU infrastructure is fully utilized and you have DevOps capacity to sustain it.

How does Gladia handle data privacy for enterprise meeting audio?

Pro and Enterprise plan customers have audio that never trains the model by default. SOC 2 and HIPAA certifications cover security, availability, confidentiality, processing integrity, and privacy, and the DPA is reviewable before contract signature.

Key terminology

Word Error Rate (WER): The percentage of words in a transcript that differ from the reference, calculated as (substitutions + deletions + insertions) / total reference words. Measures transcription accuracy but doesn't capture speaker attribution errors or semantic impact. See our WER guide for evaluation methodology.

Diarization Error Rate (DER): The percentage of total audio time containing any diarization error, combining false alarm speech, missed speech, and speaker assignment errors.

Real-Time Factor (RTF): The ratio of processing time to audio duration. RTF below 1.0 means the engine processes audio faster than real-time. Meeting bots need RTF well below 1.0 to maintain sub-300ms latency after accounting for network and LLM overhead.

Voice Activity Detection (VAD): A filter that determines whether audio frames contain speech before passing them to the ASR engine. Reduces unnecessary API calls and cost on silence-heavy meeting audio.

WebSocket: A protocol enabling real-time bidirectional communication between client and server for consistent low-latency audio transmission. The standard transport for real-time STT APIs.

WDER (Word-level Diarization Error Rate): A metric that measures word-level diarization accuracy by evaluating speaker label correctness at the individual word level. More precise than segment-level DER for action item attribution use cases.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more