TL;DR: The core decision for meeting assistants in 2026 is whether your STT layer can attribute speech to the right speaker during overlapping conversation, handle language switching mid-sentence, and deliver both at sub-300ms partial latency. Gladia's Solaria-1 hits 103ms optimal partial latency (per Gladia's self-reported benchmarks) with bundled diarization and native code-switching across 100+ languages at reportedly $0.55/hour with all features included. Deepgram Nova-3 includes diarization but has more limited multilingual coverage and separate audio intelligence feature pricing. AssemblyAI leads on async post-processing but real-time latency trails both. Self-hosted Whisper fails all three dimensions at production scale.
Building a meeting assistant requires balancing three competing constraints: latency low enough for live captions, diarization accurate enough for summaries, and costs predictable enough to model at scale. This guide compares Gladia, Deepgram, AssemblyAI, and OpenAI Whisper specifically for meeting bot architectures, filtering on the criteria that actually determine whether your product holds up in a live 12-person call with three accents and two languages.
Core requirements for meeting assistant infrastructure
Generic transcription APIs were built for single-speaker audio at controlled recording quality. Meeting audio is none of those things: overlapping speech, variable microphone distance, background noise, and participants who switch between English and their native language mid-sentence. When your STT layer fails to handle these conditions, every downstream feature built on top of it, summaries, action items, sentiment analysis, fails too.
Three capabilities separate a meeting-grade STT API from a generic one.
Speaker diarization: Knowing who said what is the foundational requirement for any useful meeting intelligence. Without it, a transcript becomes a wall of unlabeled text that no summary model can reliably attribute. Speaker diarization identifies and labels each individual speaker, turning raw audio into structured, actionable data where enterprises can track meeting actions by individual.
Fast transcript delivery: Meeting assistants often balance real-time responsiveness with post-processing accuracy. Live captions benefit from low-latency partial transcripts, while summaries and action items are typically generated from finalized transcripts produced shortly after the meeting ends. In practice, many products combine both approaches: partial transcripts for responsiveness during the meeting and more accurate finalized transcripts for downstream processing.
Code-switching detection: Global teams do not speak in one language per call. An API that requires manual language selection per session, or that degrades accuracy when the language shifts mid-utterance, delivers a broken experience for large portions of your user base.
Real-time latency budgets and partials
The latency budget for a meeting assistant has two distinct phases. Partials drive the feeling of responsiveness: a user watching text appear at 270ms after they start speaking experiences the product as immediately responsive. Finals drive the accuracy of downstream features, because your summarization model, action item extractor, and sentiment pipeline all run on corrected final transcripts.
A WebSocket streaming connection maintains a persistent channel that eliminates the 50-100ms per-request overhead of REST connections. Over a 30-minute meeting, that overhead compounds into cumulative seconds of delay with REST versus near-zero added latency with WebSocket. Gladia's live transcription parameters documentation covers the configuration options for optimizing partial versus final transcript behavior for meeting bot use cases.
Per Gladia's self-reported benchmarks, Solaria-1 achieves 103ms optimal partial latency, 270ms average, and 698ms for final transcripts, numbers that sit inside the sub-800ms threshold that production voice AI targets require. These figures are published by Gladia and have not been independently verified at the time of writing.
Speaker diarization in multi-speaker environments
Overlapping speech creates the core technical challenge in meeting transcription: most current systems assign only one speaker per segment and fail when multiple speakers are simultaneously active, leading to missed speech errors and reduced accuracy. Real meetings produce high rates of exactly this pattern, spontaneous conversation, interruptions mid-sentence, and simultaneous responses.
Diarization failures create consequences that go far beyond cosmetic issues. A misattributed quote in a transcript, where the model assigns a statement to the wrong speaker, corrupts every feature built on top of it. If your meeting summary says the VP of Engineering raised a budget concern when it was actually the CFO, the transcript has actively misled its reader.
For engineering teams building meeting products, the practical question is whether the API maintains diarization quality under real meeting conditions, not clean two-speaker test recordings. That means handling:
Overlapping speech: Two speakers starting a word simultaneously
Speaker re-identification: Maintaining attribution after someone leaves and rejoins audio
Similar voice characteristics: Distinguishing speakers with similar accents or vocal qualities
Speaker scaling: Maintaining attribution accuracy as concurrent speaker count increases
Gladia's speaker diarization documentation covers the technical implementation. Gladia's diarization is powered by the industry-leading pyannoteAI Precision-2 model, which delivers strong performance on multi-speaker audio benchmarks. Diarization is included in the base rate rather than metered separately. Deepgram's diarization documentation and AssemblyAI's speaker diarization documentation cover each provider's implementation and any associated pricing.
"Gladia provides a highly accurate real-time speech-to-text solution for high volumes of support and service calls. Latency is low and accuracy high, even for numericals. We've appreciated the quality of support across pre-processing, post-processing, and model optimization." — Verified user on G2
Code-switching and multilingual support
In real-world meetings, speakers do not stay in one language. A bilingual team standup between London, Paris, and São Paulo will produce a call where participants shift languages when clarifying a technical term or responding to a colleague in their shared native language. Traditional STT systems require a declared language at session initiation and then hold to it, which breaks when the language changes mid-utterance.
Code-switching is the real-world pattern where speakers switch languages mid-call, and an API that detects and transcribes the active language dynamically handles this without user intervention. Gladia's automatic language detection annotates results with the detected language code and works across all 100 supported languages in real-time mode. Unlike many transcription systems that infer language from accent and may misclassify speech (for example labeling a French-accented English speaker as French), Gladia detects the spoken language directly and maintains transcription accuracy even with strong regional accents.
Including 42 languages that Gladia claims, per the company's self-reported figures, are not covered by any competing API. That claim has not been independently verified, but Gladia's published language list is available for review.
For meeting assistant products with global user bases, this is not an edge case. Your QA pipeline likely passed on English audio. The support tickets from Finnish, Bengali, or Tagalog-speaking users are a separate failure mode that only surfaces after launch.
"Excellent multilingual real-time transcription with smooth language switching... Superior accuracy on accented speech compared to competitors." — Yassine R. on G2
Top real-time STT models compared
The table below compares Gladia, Deepgram, AssemblyAI, and OpenAI Whisper on the five dimensions that determine production fitness for a meeting assistant. Pricing reflects publicly available rates as of Q1 2026; verify current rates on each vendor's pricing page before modeling costs.
| Feature |
Gladia (Solaria-1) |
Deepgram (Nova-3) |
AssemblyAI (Streaming) |
OpenAI Whisper (Self-Hosted) |
| Partial latency |
103ms optimal, 270ms avg (self-reported) |
202ms interrupt latency (per Gladia comparison) |
Not published |
Not viable for real-time |
| Final latency |
~698ms (self-reported) |
Sub-300ms |
Not published |
Batch only |
| Diarization |
Included in base rate |
Included in Nova |
Priced separately |
Requires separate pipeline |
| Code-switching (real-time) |
100+ languages, native detection |
Limited; check language docs |
6 languages (streaming only) |
Not real-time |
| Languages (real-time) |
100+ (42 claimed unique) |
Limited multilingual |
Limited multilingual |
99 languages |
| Pricing model |
~$0.55/hr (all features included) |
Base rate + audio intelligence add-ons |
Base rate + add-ons |
GPU infra + DevOps overhead |
| Free tier |
10 hours/month |
Yes |
Yes |
N/A |
| WebSocket support |
Yes |
Yes |
Yes |
Requires custom infrastructure |
Gladia
According to Gladia, Solaria-1 was trained on noisy, real-world, multi-speaker audio rather than clean studio recordings—a stated design priority intended to improve production performance on meeting audio.
The model’s primary differentiator is its multilingual coverage and language detection capabilities, enabling accurate transcription across a wide range of languages and accents with native code-switching support.
Gladia's benchmark results on anonymized enterprise contact center datasets show Solaria-1 achieving 94% word accuracy rate on English tested on Mozilla Common Voice and Google FLEURS datasets, In these evaluations, Solaria-1 reportedly outperformed Deepgram's Nova-3 in the multilingual test context. These results are based on benchmarks published by Gladia, and additional third-party benchmark comparisons are available on sites such as Artificial Analysis.
For meeting assistant use cases specifically, three features drive the practical difference:
Bundled audio intelligence: According to Gladia's pricing page, speaker diarization, sentiment analysis, custom vocabulary, and named entity recognition are included at the base rate, not metered separately.
Native code-switching: Language detection happens dynamically mid-utterance without requiring a session restart or manual language declaration.
Partial transcript speed: Per Gladia's published benchmark, Solaria-1's 103ms optimal partial latency is faster than Deepgram's previously published interrupt latency.
Claap, a meeting and video workspace product, reported reaching 1-3% WER in production and transcribing one hour of video in under 60 seconds. Aircall reported reducing transcription time by 95% after switching from a self-hosted solution, according to a Gladia case study attributed to Aircall's engineering team.
"In less than a day of dev work we were able to release a state-of-the-art speech-to-text engine!... We have tested it across many many languages (we work with commentators in pro sports around the world) and have found great accuracy even with custom fields such as team names, player names, etc. We have never come across any sort of hallucination." — Xavier G. on G2
The audio intelligence features extend beyond transcription into sentiment analysis, named entity recognition, and custom vocabulary. For product teams building meeting intelligence on top of the transcription layer, all-inclusive pricing removes the cost compounding problem where each additional feature multiplies the hourly bill. For teams migrating from a competitor, the migration guide from Deepgram covers the WebSocket parameter mapping to reduce switchover time.
Deepgram
Deepgram's Nova-3 model includes speaker diarization and word-level timestamps in its base rate, which addresses the core attribution requirement. Its interrupt latency benchmarks at 202ms per Gladia's published comparison though that figure comes from Gladia's own testing, not an independent source. Deepgram's own documentation and status page are the authoritative resources for evaluating production reliability.
Nova-3's multilingual coverage for real-time code-switching is more limited than Gladia's 100-language real-time support. Audio intelligence features beyond basic transcription and diarization, including sentiment analysis and named entity recognition carry separate pricing per Deepgram's pricing page. That add-on structure makes unit economics harder to model at scale, particularly when processing high volumes of multi-feature meeting pipelines.
For products serving a primarily English-speaking market where diarization is the only required intelligence feature, Deepgram is fast and capable. For teams with accented speakers, mid-call language switching, and full audio intelligence requirements, the language coverage ceiling and add-on pricing are the relevant trade-offs to evaluate.
AssemblyAI
AssemblyAI's strongest position is in async post-processing. Its LeMUR framework processes up to 10 hours of audio in a single API call, enabling question answering, summaries, and action item extraction on long recordings. For teams building async meeting recap features on batch-processed audio, that is a differentiated capability.
Gladia also supports async post-processing through its audio-to-LLM pipeline, enabling similar workflows such as summarization, entity extraction, and other downstream analysis. However, LeMUR's depth of LLM integration, including custom prompt templating and multi-file context, is more developed as of this writing. Teams whose primary use case is async summarization should evaluate both against their specific prompt engineering requirements.
The trade-off is real-time performance. AssemblyAI does not publish specific latency figures for its streaming API; the comparison table reflects the absence of published numbers rather than a measured value. AssemblyAI's streaming documentation is the authoritative source for current performance specifications. AssemblyAI's streaming code-switching covers six languages: English, Spanish, Portuguese, French, German, and Italian compared to Gladia's 100+ in real-time mode. For a meeting assistant serving European or North American markets in those specific languages, that may be sufficient. For any broader language coverage, it is a hard ceiling.
Speaker diarization and additional audio intelligence features carry separate pricing in AssemblyAI's model per their pricing page, which creates the same cost projection complexity at scale. The migration guide from AssemblyAI covers the parameter mapping if your team is evaluating a switch.
OpenAI Whisper (API vs. open source)
Whisper's accuracy on clean, controlled audio remains a strong baseline. The problem for meeting assistant applications is that the self-hosted Whisper model processes audio in 30-second internal chunks and was not designed for real-time streaming at production scale.
Without additional optimization layers, the self-hosted model does not batch requests on the GPU, which wastes compute and raises latency. For real-time diarization specifically, self-hosted Whisper requires a separate diarization pipeline layered on top of transcription output, which adds integration complexity and another latency variable. The OpenAI-hosted API has expanded streaming support: while the older whisper-1 model does not support streaming, newer models like gpt-4o-transcribe support streaming transcription with stream=True. However, for real-time voice applications requiring the lowest latency, OpenAI's Realtime API reached general availability in August 2025 and is the recommended path rather than the transcription API.
GPU infrastructure for self-hosting runs approximately $276/month minimum for a dedicated instance, plus $50-200/month in DevOps overhead, and the engineering team maintaining that infrastructure is not building the meeting intelligence features that differentiate your product. The optimization techniques for Whisper at scale required to approach production-viable latency represent weeks of engineering work that a managed API eliminates.
Integration challenges with Zoom, Teams, and Google Meet
Getting audio out of a meeting platform is a prerequisite to transcription, and the method you choose determines your latency floor before you ever call the STT API.
Meeting bots (virtual participants) join the meeting as a participant using the platform's SDK, capture audio per participant, and stream it to your transcription pipeline. Meeting bots provide per-participant audio streams rather than a mixed feed, which is critical for diarization accuracy since the model receives clean per-speaker input rather than having to separate speakers from a composite signal. This approach works consistently across Zoom, Microsoft Teams, and Google Meet, and the bot-to-pipeline latency is generally well within the range needed for real-time transcription, though exact figures depend on your infrastructure configuration.
RTMP live streaming presents significant limitations for meeting intelligence use cases. RTMP delivers a single mixed audio feed with no per-participant tracks, so you lose the diarization quality advantage of clean speaker-separated input and must rely entirely on model-side speaker separation. RTMP protocol latency typically adds 3-5 seconds of delay, which makes live captions challenging, and platform consistency is a problem: Google Meet has no API for RTMP configuration, and Microsoft Teams requires manual setup with no programmatic control. For meeting assistant products where live captions and speaker-attributed summaries are both requirements, the bot approach is the only architecture that delivers both reliably.
Gladia's recommended parameters by use case documentation covers the WebSocket configuration for bot-to-API streaming, including buffer and endpointing settings optimized for meeting audio.
Cost modeling for high-volume meeting bots
Pricing for meeting assistant infrastructure fails the same way most infrastructure pricing fails: the headline rate looks reasonable until you enable the features the product actually needs. Sentiment analysis, named entity recognition, and custom vocabulary are what make a meeting transcript useful beyond raw text. If each is a billing add-on, the actual cost per hour is a multiple of the base rate that is difficult to model without a detailed pricing call.
Here is what the math looks like at scale for an all-inclusive model versus an add-on model.
At 10,000 hours/month:
| Provider |
Base Rate (Real-Time) |
Diarization |
Audio Intelligence Features |
Estimated Total @ 10,000 hrs |
| Gladia |
~$0.55/hr (reported) |
Included |
Included |
See pricing page for verified totals |
| Deepgram |
See Deepgram pricing |
Included in Nova |
Priced as add-ons |
Base rate + add-on cost per feature |
| AssemblyAI |
See AssemblyAI pricing |
Priced as add-on |
Priced as add-ons |
Base rate + add-on cost per feature |
| Whisper (Self-Hosted) |
GPU (~$276/mo min) |
Separate pipeline required |
Not bundled |
GPU cost + DevOps overhead |
Deepgram and AssemblyAI do not publish flat all-in rates equivalent to Gladia's model; their final per-hour cost for a full meeting intelligence pipeline depends on which features you enable and at what volume tier. Verify current rates directly on their pricing pages before building a cost model.
Gladia's pricing page publishes the full rate structure, including what is covered at each tier. Billing is reportedly per-second rather than per block, which matters when you process high-volume meeting audio with variable call lengths. Per-second billing means you pay for exactly the audio you process, with no rounding overhead accumulating across a high-volume pipeline.
The free tier provides 10 hours of transcription each month at no cost, so you can test the integration against your actual meeting audio before committing to a volume plan.
"Their transcription quality is the best for many languages. Their support is high quality; you can even contact their CTO, etc. It made a difference for our services. Their documentation is clear and easy to integrate and implement." — Verified user on G2
Choosing the right architecture
Three variables determine the right STT architecture for your meeting assistant: your latency requirement, your language coverage need, and your cost model at target volume.
Choose Deepgram Nova-3 if your product serves a primarily English-speaking market, Nova's included diarization covers your core attribution needs, and your audio intelligence requirements beyond diarization are minimal. Nova-3 delivers genuine speed and strong accuracy on English.
Choose AssemblyAI if your meeting assistant is async-first and the primary value proposition is post-processing quality, specifically LLM-powered summaries and action item extraction on completed recordings. If live captions are not a core feature, the real-time latency gap matters less, and LeMUR's batch processing capability is differentiated for that use case.
Avoid self-hosted Whisper for real-time meeting use cases unless you are processing well above 3,000 hours per month and have the engineering bandwidth to maintain GPU infrastructure, build a separate diarization pipeline, and debug the edge cases that surface in production meeting audio. The infrastructure overhead that looks like cost savings at project start tends to consume sprint capacity that should go to product work.
Choose Gladia if your meeting assistant serves a global user base with multiple languages and accents, diarization accuracy on overlapping speech is non-negotiable for your summary features, and you need to model costs predictably at scale without audio intelligence add-ons compounding against the base rate. Claap's reported result: 1-3% WER in production with one hour of video transcribed in under 60 seconds—is the most relevant production evidence for this use case because it comes from a team building a comparable product. That result is sourced from Gladia's published case study and attributed to Claap's engineering team.
The buyer's guide to speech-to-text APIs covers the evaluation framework beyond the meeting assistant use case if you are assessing Gladia for additional audio pipelines.
Gladia's free tier reportedly includes 10 hours of audio processing, which is sufficient for evaluating transcription accuracy and latency under realistic meeting conditions before committing to a paid plan. Diarization, code-switching, sentiment analysis, and named entity recognition are reportedly included at the base rate with no setup fees or add-ons. Teams requiring a guided walkthrough of the meeting assistant configuration can request a demo directly with the Gladia team.
Frequently asked questions
What partial transcript latency should I target for live meeting captions?
Industry guidance for voice AI applications targets sub-500ms initial response time for conversational use cases, while live captioning can tolerate 1-3 seconds of delay before users notice the lag. Optimizing partials toward sub-300ms serves both use cases and keeps the full pipeline well within natural conversation timing. Per Gladia's self-reported benchmarks, Solaria-1 achieves 103ms optimal and 270ms average for partials, with finals at ~698ms.
Does Gladia's diarization work on overlapping speech in real-time?
Yes. Gladia's speaker diarization handles overlapping speech in real-time mode, and it is included in the base rate rather than priced separately.
What is the difference between per-second and per-block billing for meeting transcription?
Per-second billing charges for exact audio duration processed. Per-block billing (typically 15-second increments) rounds up partial blocks, inflating costs across a high-volume pipeline with variable-length meetings where partial blocks accumulate at scale.
What is the cost difference between bot-based audio capture and RTMP for meeting transcription?
Bot capture provides per-participant audio streams with latency that is generally compatible with real-time transcription requirements. RTMP adds 3-5 seconds of protocol delay with a single mixed feed and no speaker attribution. The latency difference and the loss of per-participant streams make RTMP a poor fit for live caption requirements.
Can Gladia detect language changes mid-sentence without a session restart?
Yes. Setting language_behaviour to
"automatic multiple languages"
enables real-time code-switching detection across all 100 supported languages without interrupting the stream.
What is Gladia's free tier for testing meeting bot integrations?
10 hours of transcription per month at no cost, with all features including diarization and code-switching included at every tier.
How does Gladia's multilingual real-time coverage compare to AssemblyAI streaming?
Gladia supports 100+ languages with native code-switching in real-time mode. AssemblyAI's streaming code-switching covers six languages: English, Spanish, Portuguese, French, German, and Italian.
Key terminology
Word error rate (WER): The standard accuracy metric for transcription, calculated as the number of substitutions, deletions, and insertions divided by the total number of words in the reference transcript. Because insertions count against the score, WER can theoretically exceed 100%. The dataset and audio condition matter as much as the number itself.
Speaker diarization: The process of segmenting audio into sections attributed to individual speakers. Critical for meeting transcription where "who said what" determines the utility of summaries and action items.
Code-switching: The pattern where speakers shift between languages mid-conversation. Production-grade meeting APIs detect and transcribe language changes dynamically rather than requiring a fixed language declaration at session start.
WebSocket: A persistent bidirectional connection protocol used for real-time audio streaming. Unlike REST, which opens and closes a connection per request, WebSocket maintains an always-open channel that eliminates 50-100ms of per-request overhead across a full conversation session.
RTMP (Real-Time Messaging Protocol): A streaming protocol used by some meeting platforms for audio export. Adds 3-5 seconds of protocol latency and delivers a single mixed audio feed, which limits its fitness for live captions and speaker-attributed transcription.
Partial transcript: The in-progress, incomplete transcript returned while a speaker is still talking. Drives the perceived responsiveness of live caption UIs and is distinct from the final transcript, which is the corrected output after the utterance is complete.
Per-second billing: A billing granularity model where you pay for exactly the audio duration processed. Contrasts with per-block billing where partial blocks round up, inflating invoices on high-volume pipelines with variable-length calls.