TL;DR: Standard WER benchmarks don’t fully capture what can break meeting assistants in production. The main failure points include crosstalk during interruptions, diarization that struggles with overlapping speech, and multilingual teams code-switching mid-sentence. Handling audio across 100+ languages reliably is critical for accurate transcripts and actionable insights.
Building a meeting assistant isn't just a speech-to-text (STT) accuracy problem. It's an architecture problem that manifests as accuracy failures. The demo works because demos are controlled: one speaker, clear audio, stable internet, and the single language your team speaks. Production looks nothing like that. It's a distributed sales call with five people interrupting each other in English and Spanish, a flaky mobile hotspot, and a privacy lawyer who joined for the last ten minutes and stayed on the call.
The gap between "it works in staging" and "it holds up in production" is where most meeting assistant projects quietly fail. Here are the six mistakes that cause it, and how to build your way out of each one.
Mistake 1: Optimizing for clean benchmarks instead of "messy" production audio
Why traditional WER fails for voice agents
WER measures STT accuracy by comparing output to reference text, but the calculation methodology makes it a poor proxy for meeting transcription quality. As our WER analysis details, the metric assigns equal penalty weight to every error regardless of semantic impact. Substituting "the" with "a" counts the same as substituting "cancel" with "confirm." Two transcripts with identical WER scores can have completely different levels of usability for action-item extraction.
The problem compounds when you consider what WER doesn't measure at all. Punctuation, capitalization, and speaker attribution are invisible to WER scoring, but they're exactly what downstream summarization and action-item models depend on. A vendor's reported WER on a clean benchmark dataset like LibriSpeech tells you almost nothing about how their model handles a 45-minute all-hands with five speakers, a shared presentation in the background, and someone joining from a construction site.
The "crosstalk" and interruption reality
When two people speak simultaneously, many STT systems struggle to separate the audio signals. The result can include merged speech segments, dropped words, or degraded transcripts that confuse diarization and downstream meeting summarization or action-item extraction. Research on speaker diarization shows that overlapping speech remains a challenge for some systems, often leading to errors in identifying who spoke when and propagating mistakes into analytics and automated meeting notes.
Testing your integration specifically against overlapping speech from the start prevents this failure mode. Consider building a QA audio set that includes back-to-back interruptions, crosstalk, and rapid speaker handoffs. If your STT layer can't handle that, diarization output becomes unreliable, and meeting summaries fail by construction.
Mistake 2: Underestimating the complexity of real-time speaker diarization
The difference between "what was said" and "who said it"
Speaker diarization partitions an audio stream into homogeneous segments according to speaker identity, answering the question "who spoke when?" without any prior knowledge of the speakers' personal identities. The output is labels like "Speaker 1" and "Speaker 2," not names.
That distinction matters architecturally. When diarization fails, the system assigns action items to the wrong person, scores sentiment against the wrong speaker, and the resulting errors in meeting summaries can be difficult to catch through automated validation.
Diarization is typically implemented as a separate inference process alongside or after transcription, and its accuracy degrades under exactly the conditions that make meeting audio hard: short speaker turns, overlapping speech, and variable microphone quality across participants.
The impact compounds in regulated industries. Legal analysis from Smith Law highlights the legal risks AI note-takers introduce in regulated environments, including proceedings where accurate speaker attribution directly affects how sensitive communications are handled. For production readiness, validate diarization specifically against:
- Meetings with four or more concurrent participants
- Audio with mixed microphone quality (headset vs. laptop speaker vs. mobile)
- Speaker turns shorter than two seconds
- Meetings where the same speaker returns after a long silence
Our speaker diarization documentation covers parameter configuration for these edge cases, including how to set speaker count bounds when you know the expected range.
Handling "ghost" speakers and hallucinations
Ghost speakers, where the model invents a new speaker label for background noise, a cough, or a filler word, inflate speaker counts and corrupt attribution downstream. STT models can also generate text that wasn't present in the audio during silence or low-level noise, a hallucination problem that's particularly damaging when output feeds an LLM summarization pipeline.
Testing specifically for hallucinations during silence is non-negotiable for meeting assistant reliability. Process audio files with deliberate silent gaps at 10, 30, and 60 seconds and verify that transcript output matches only what was said.
Mistake 3: Ignoring the code-switching reality of global teams
When English isn't just English
Code-switching, the practice of alternating between two or more languages within a single conversation, appears constantly in global teams and bilingual support workflows. Sales calls between Madrid account executives and Latin American clients commonly switch between Spanish and English mid-sentence. APAC engineering standups mix English technical vocabulary with Mandarin or Tagalog sentence structure constantly.
Research published in MDPI identifies code-switching as one of the most difficult unsolved challenges in production ASR because these systems must anticipate that each audio sample may contain more than one language identification value. Standard monolingual models forced into a single language code respond to unexpected language input by producing phonetic approximations in the configured language: Spanish words rendered as garbled English phonemes, or English terms approximated into Mandarin phonetics.
Nvidia's research on multilingual ASR confirms this: models must be trained with the understanding that each token may require a different language identification, and most production state-of-the-art systems don't handle fully automated transcription of code-switched audio robustly.
The architectural requirement is clear: if your user base includes any multilingual speaker population, you need an STT API with token-level language identification built into the inference pipeline, not a wrapper that forces a language selection at the request level. Our code-switching capability handles language transitions mid-sentence across all 100 supported languages in both real-time and async modes, which is what makes it applicable for BPO (Business Process Outsourcing) contact centers and global team meetings where the audio doesn't stay in one language.
Mistake 4: Treating WebSocket integration as a "set and forget" pipeline
Common connection failures (TLS, outbound ports)
Real-time STT over WebSocket is not a fire-and-forget configuration. Production WebSocket connections fail in specific, predictable ways that are invisible in local development but surface immediately in enterprise deployment environments.
The most common failure modes:
- Code 1006 and network blocks: OneUptime's analysis of WebSocket 1006 errors confirms that abnormal closure without a proper close frame typically results from proxy timeouts, firewall interruptions, or forcible termination by a network appliance. Enterprise configurations often block outbound connections on non-443 ports, causing silent failures at the firewall before any application-level error handling responds. This is the most frequent production failure pattern and it appears with no informative error message.
- HTTP 400 on upgrade: A 400 response during the WebSocket handshake typically means malformed request parameters, incorrect audio format specification, or a missing configuration field. These errors are easy to debug in staging but appear as intermittent production failures when audio encoding settings drift.
- HTTP 401/403 authentication failures: Response headers like
request-id and error fields are critical for debugging failed upgrade attempts, which is the pattern to replicate in your own logging: always capture the server-side request ID on failed connections so you can correlate client-side failures with API-side error states.
The debugging approach for 1006 errors is specific: capture a network trace to determine whether speech service hostnames are being blocked, and whitelist the required endpoints at the proxy or org firewall level before assuming the issue is application code.
The danger of silent retries
The exponential backoff strategy from Hexshift highlights the core production risk: naive retry logic without backoff puts additional load on an already-strained service and produces cascading failures. More damaging for meeting assistants is silent retry behavior, where the connection drops, the client retries immediately with no user signal, and the transcript for the intervening period is simply lost.
The WebSocket keepalive specification recommends ping/pong heartbeats at intervals shorter than your proxy's idle timeout. A standard Nginx configuration times out at 60 seconds, which means a heartbeat every 45 seconds is sufficient to keep the connection alive through meeting silence. The production-ready architecture requires:
- Heartbeat pings every 30-45 seconds to prevent idle timeout disconnects
- Exponential backoff on reconnection attempts, not immediate retry loops
- Server-side request ID logging to correlate client-side failures with API error states
- Audio buffer management to handle the gap between disconnect and reconnect without dropping spoken content
Mistake 5: Overlooking legal, privacy, and governance risks
The "hot mic" problem: recording after the meeting ends
The most serious production risk that engineering teams consistently underestimate is recording audio after the participant consent window has closed. Meeting participants consent to being recorded during the meeting, not for the conversations that continue after the formal session has ended.
Faegre Drinker's meeting consent analysis details considerations for AI Meeting Assistants: recording a meeting may create legal exposure in jurisdictions that require all-party consent for recording, such as multiple U.S. states, especially if a recording bot continues capturing audio post-meeting.
Smith Law's legal risk analysis identifies legal risks that AI note-takers introduce in regulated environments, including scenarios where uninvited capture of sensitive communications could create liability exposure for organizations operating under professional or statutory confidentiality obligations. The architectural requirement is a deterministic meeting end signal, not a silence-based timeout.
Data residency and compliance (SOC 2, GDPR)
GDPR-compliant meeting assistant guidance from MeetingNotes notes that GDPR mandates data protection by design, including encryption, data minimization, and documented retention windows as legal requirements. Maximum fines reach €20 million or 4% of annual global turnover, and routing EU audio to US servers creates transfer compliance challenges requiring Standard Contractual Clauses before you can ship to European customers.
Gladia offers a zero-retention option for paid plans, meaning audio is processed and discarded without model retraining. Our SOC 2 Type 2 certification and GDPR posture are documented on our compliance hub, which means your security review team has something concrete to evaluate before legal reviews the contract.
Mistake 6: Failing to model unit economics at scale (the "add-on" trap)
How per-feature pricing compounds
Standard pricing architectures stack features on top of a base transcription rate: diarization as an add-on, sentiment analysis as a separate add-on, PII redaction at a third tier, summarization at a fourth. Add-on pricing models compound quickly: each feature tier layered onto the base rate pushes the effective per-hour cost higher, and the bill at 10,000 hours looks nothing like the base rate on the pricing page. This pattern appears across enhanced tiers at multiple vendors.
Pricing comparison at 10,000 hours/month:
| Feature configuration |
Add-on model (typical rates) |
Gladia all-inclusive |
| Base STT only |
$1,500 |
Included |
| + Diarization |
$1,700 |
Included |
| + Sentiment analysis |
$1,900 |
Included |
| + Entity extraction |
$2,100+ |
Included |
The effective cost delta becomes visible only when you model the full feature set your meeting assistant actually requires, not the base rate highlighted on the pricing page.
The predictability of per-second billing
Our all-inclusive model means diarization, sentiment analysis, entity extraction, and audio intelligence features are part of the base rate. Your cost model at 10,000 hours is the hourly rate multiplied by the hours, and that number matches the invoice.
Per-second billing removes a second source of cost variance. Block-based billing at 15-second minimums can waste partial-block time on every call, and that waste accumulates across thousands of short meeting segments. For teams currently on AssemblyAI or Deepgram and evaluating a switch, our migration guide from AssemblyAI and from Deepgram cover the specific API parameter changes required for real-time STT, including WebSocket connection setup and response format differences.
How to architect for production readiness
Testing against the "unhappy path"
The QA checklist that catches demo-environment assumptions before they become production failures:
- Background noise: Test with ambient office noise, construction audio, and shared-speaker laptop microphones
- Accent diversity: Test with non-native English speakers across at least three distinct regional accent profiles
- Overlapping speech: Build test cases with interruptions and crosstalk
- Code-switching: If your user base includes any bilingual population, test with real mixed-language audio, not monolingual samples in each language separately
- Network degradation: Simulate packet loss at 1%, 2.5%, and 5% to validate your WebSocket reconnection logic under degraded real-world network conditions.
- Silent periods: Verify the model produces no hallucinated output during silence segments of 10, 30, and 60 seconds
- Post-meeting audio: Confirm your bot has a deterministic termination mechanism that doesn't rely on silence detection
"Gladia provides a highly accurate real-time speech-to-text solution for high volumes of support and service calls. Latency is low and accuracy high, even for numericals." - Verified User in Financial Services on G2
Leveraging our unified audio intelligence API
The case for a unified API is an engineering team capacity argument. Claap's production case study benchmarks what that tradeoff looks like in practice: the case study reports 1-3% WER in production and one hour of video transcribed in under 60 seconds, built on our API. Aircall's implementation reduced transcription time by over 90% after switching from a self-hosted solution, which returned engineering sprint capacity to product work rather than infrastructure maintenance.
For a full evaluation against your own audio, including your specific language mix and noise conditions, our benchmark methodology covers the datasets and audio conditions we use for accuracy testing. The Playground walkthrough lets you test real-time transcription against your own audio files before writing a line of integration code.
Don't build for the demo. The meeting assistants that hold up in production are the ones tested against the worst audio your users will actually send: three people talking at once, a bilingual sales rep, and a WebSocket connection behind a corporate proxy. That's not an edge case. That's Tuesday. Build your QA environment accordingly, model your costs at 10x current volume before committing to a vendor, and resolve your data residency requirements before your first enterprise security review surfaces them as blockers.
Get started with 10 free hours and test your integration in production in less than a day. All features are included by default with no setup fees, add-ons, or hidden costs. Try the system on your own data to explore volume pricing and see how it performs in practice.
FAQs
Why does my real-time transcription drop connection?
The most common causes are enterprise firewalls blocking outbound connections on non-443 ports and load balancers silently timing out idle WebSocket connections after 30-120 seconds. Implement heartbeat pings every 30-45 seconds and whitelist your STT API's WebSocket endpoint on all relevant network appliances.
How do I handle multiple languages in one meeting?
Configure an STT API with robust code-switching capabilities rather than forcing a single-language code at the request level. Single-language forced configurations produce phonetic gibberish when the audio switches languages, because the model attempts to transcribe foreign words as phonemes in the configured language. Gladia’s Solaria-1 supports code-switching across 100+ languages, automatically detecting languages at the token level to maintain accurate transcription even in multilingual audio.
What is the difference between diarization and speaker identification?
Diarization answers "who spoke when" without prior knowledge of the speakers, producing labels like "Speaker 1" and "Speaker 2." Speaker identification maps those labels to specific named individuals using pre-enrolled voiceprints and requires enrollment data before the meeting.
Does Gladia use my meeting audio for training?
No. We offer a zero-retention option for paid plans, where audio is processed and discarded without being used to train models. Users on the free tier should be aware that their audio may be used for model training. No opt-out is required, and there is no enterprise contract clause to locate.
Key terms glossary
Word Error Rate (WER): The standard metric for STT accuracy, calculated by adding substitutions, insertions, and deletions divided by the total number of reference words. WER assigns equal penalty weight to all error types regardless of semantic impact.
Code-switching: The practice of alternating between two or more languages or language varieties within a single conversation, often mid-sentence.
Diarization: The process of partitioning an audio stream into segments according to speaker identity, answering "who spoke when" without prior knowledge of the speakers' names.
WebSocket: A communication protocol providing full-duplex communication over a single TCP connection, required for real-time audio streaming from a meeting bot to an STT API.
Hallucination: In STT, the model generating text that was not present in the audio, most commonly during silence or low-level background noise.
Latency budget: The total time allocated across all pipeline components (audio capture, STT inference, LLM processing, response generation) for a real-time application to remain responsive.
Data residency: The requirement that data be stored and processed within a specific geographic jurisdiction, a compliance requirement under GDPR for EU customer audio.