Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Read more

Speech-To-Text

ElevenLabs vs Gladia: speech-to-text comparison for voice AI builders

ElevenLabs vs Gladia comparison for voice AI builders. Compare STT accuracy, latency, pricing, and features for production agents. Get real-world accuracy metrics, total cost models, and technical specs to evaluate whether unified vendor stack or best-of-breed STT fits your pipeline.

Speech-To-Text

Meeting bot speech recognition: how real-time transcription powers automated meeting assistants

Meeting bot speech recognition requires sub-300ms STT latency, real-time diarization, and code-switching for reliable transcripts. Production meeting bots fail when transcription infrastructure cannot handle multi-speaker overlap, language switching, and speaker attribution in real time.

Speech-To-Text

Meeting transcription common mistakes: what meeting assistant builders get wrong

Meeting transcription mistakes that break production systems: crosstalk handling, diarization failures, and code switching issues. Learn how to architect STT pipelines that survive real world audio conditions, avoid silent WebSocket failures, and prevent cost model surprises at scale.

Multilingual meeting transcription: language coverage, accuracy, and code-switching challenges

Published on March 25 2026
by Ani Ghazaryan
Multilingual meeting transcription: language coverage, accuracy, and code-switching challenges

Multilingual meeting transcription requires testing code-switching, accented speech, and diarization on real audio before committing. Standard WER benchmarks degrade 2.8 to 5.7x in production, so evaluate APIs on your own noisy meeting recordings to avoid user churn from accuracy failures.

TL; DR:  Languages supported tells you almost nothing about how a meeting assistant performs for a distributed global team. The failure points that drive user churn are code-switching (switching languages mid-sentence), accented speech on low-bandwidth microphones, and diarization errors in multi-speaker calls. Standard WER benchmarks on clean studio datasets routinely overstate real-world accuracy, production environments with accented speech, background noise, and overlapping speakers consistently produce materially higher error rates than published figures suggest. APIs that price diarization and language detection as add-ons can cost significantly more than their advertised rate at scale. Test on your own noisy, accented, code-switched meeting audio before you commit to any vendor.

Building a meeting assistant for a distributed workforce requires more than a long list of supported languages. It requires an audio pipeline capable of navigating code-switching, heavy accents, and overlapping speech in real-time, and most standard vendor evaluations won't tell you whether your chosen API can handle any of those conditions. This guide gives you the framework to test multilingual STT performance accurately, before your users deliver the verdict through support tickets.

Two baseline metrics anchor this evaluation. WER (word error rate) measures accuracy as the sum of insertions, deletions, and substitutions divided by the total reference word count. RTFx (real-time factor) measures throughput as audio duration divided by processing time, so an RTFx of 100 means 100 seconds of audio processed per second of compute time. Both matter here, but neither means much if you're measuring them on the wrong data.

Why standard WER benchmarks fail for global meeting assistants

The most commonly cited STT accuracy figures come from datasets like LibriSpeech, which is clean, read speech recorded from audiobooks. Models routinely hit 95%+ accuracy on LibriSpeech, and that number circulates freely in marketing materials. What those materials omit is the gap between that dataset and the audio your users actually produce.

Standard benchmarks exclude:

  • Overlapping speech from multiple speakers
  • Low-bandwidth microphones (typical Zoom or Teams quality after compression)
  • Background noise from offices, cafes, and street environments
  • Non-native accents and syllable-timed speech patterns

A systematic review of clinical ASR systems published in JAMIA found consistent performance degradation when moving from controlled benchmark conditions to real-world production environments. For meeting transcription specifically, the AMI Meeting Corpus is a far more honest benchmark because it captures multi-speaker dynamics and realistic room acoustics, while LibriSpeech and CallHome, two commonly cited ASR test sets, exhibit up to 10% WER difference for the same engine.

Speaker diarization compounds this problem. Diarization is the process of identifying who spoke when in multi-speaker audio, attributing each word or segment to the correct speaker label. When diarization errors stack on top of transcription errors in a six-person meeting, the output becomes unusable even at modest WER levels. A vendor's homepage WER figure measured on clean English audio tells you very little about accuracy on a Tuesday afternoon Zoom call with four non-native English speakers and someone dialling in from a train.

The technical challenges of multilingual transcription in production

Two technical challenges determine whether a multilingual STT system survives contact with real global meeting audio: accent handling and code-switching. Standard models trained primarily on native-speaker audio fail on both, often in ways that don't surface until users start submitting support tickets.

Handling accented speech and non-native speakers

Accent robustness is a baseline requirement for global meeting assistants, not an edge case. The acoustic properties of accented speech create specific failure modes that standard models aren't designed to handle.

One well-documented example: Indian English often exhibits a syllable-timed rhythm rather than the stress-timed rhythm common to most English accents, and may show less distinction between aspirated and non-aspirated sounds. Models not trained on these patterns misattribute syllable boundaries and produce substitution errors that compound across a 60-minute meeting transcript.

Mozilla Common Voice and Google FLEURS both provide diverse accent coverage and are publicly available for evaluation. A model that performs well on FLEURS across your target language regions gives you a more reliable signal than a clean-data WER score.

"Gladia deliver real time highly accurate transcription with minimal latency, even across multiple languages and accents, The API is straightforward and well documented, Making integration into our internal tools quick and easy." - Faes W. on G2

Gladia has been recognized by users for its strong performance with accents, one Reddit user specifically highlighted how well it handles diverse pronunciations compared to standard models.

Code-switching and why most models fail

Code-switching is what happens when a speaker alternates between two languages within a conversation or within a single sentence. Linguists distinguish two main types: intrasentential switching occurs within a sentence ("Tengo que ir to the mall"), while intersentential switching occurs at sentence boundaries ("If I'm late, pues, ni modo"). Both types appear regularly in multilingual business meetings and both expose the architectural limits of most STT models.

When a model lacks native code-switching support, it fails predictably. The model locks onto the first detected language and attempts to transcribe the second language using the phonetic vocabulary of the first, producing phonetic gibberish. The second failure mode is unintended translation: the model silently converts the second language into the first rather than preserving the original.

This pattern appears consistently in Whisper user reports. GitHub discussions indicate that enabling the transcribe task has no effect when multiple languages are present, with output defaulting to translation regardless of intent. A separate thread reports the same issue: when French and English are both spoken, the model translates the entire audio into French. The Whisper repository notes that the model alternates between transcription and translation tasks depending on training, which makes code-switching behavior fundamentally unpredictable.

For a meeting assistant, this means a bilingual sales call with a Latin American prospect can produce a hallucinated English transcript with no indication that the source audio contained Spanish.

How to evaluate multilingual STT performance: A checklist for product leaders

Real-time vs. asynchronous processing trade-offs

Live meeting assistants require final transcript delivery in under 300ms for real-time note display and voice agent response. Asynchronous (batch) processing is more affordable and optimal for post-meeting summaries, action items, and other use cases where users don’t mind waiting, allowing higher accuracy since the model has access to more audio context.

Real-time models operate on short context windows to meet latency requirements, which reduces their ability to resolve ambiguous phonemes or speaker attribution. Async models process the full recording before returning results, which is why batch WER is typically lower than streaming WER for the same model and the same audio.

If you're building a live assistant, test time-to-first-byte (TTFB) separately from latency to final. Both affect the user experience in live contexts, and conflating them will produce misleading vendor comparisons.

Testing methodology: Datasets and noise conditions

Vendor-provided sample audio only tells you how the API performs on vendor-selected audio. Test on your own production data, or data that closely matches it, to get the signal your users will actually experience.

A structured evaluation should cover five distinct audio conditions:

  1. Accented speech test: Source audio from speakers with the accents most common in your user base. Test WER per accent group separately using Mozilla Common Voice samples for your target languages.
  2. Code-switching test: Use clips where speakers switch languages mid-sentence, not just between sentences. Verify the transcript preserves the original language rather than translating.
  3. Multi-speaker crosstalk test: Use clips with 2-3 seconds of overlapping speech to check diarization accuracy and transcription quality during interruptions.
  4. Low-bandwidth audio test: Downsample clean audio to 8kHz (standard VoIP quality) and re-run transcription. This simulates typical Zoom or Teams audio after compression.
  5. Background noise test: Mix clean audio with office or cafe background noise at varied signal-to-noise ratios. The CHiME Challenge benchmarks are the standard reference for noisy real-world conditions.

Minimum test volume: A standard evaluation uses at least 10,000 words (roughly one hour of audio) per condition. Test each provider with identical audio through an identical harness, logging TTFB and latency to final separately. Any provider that won't give you a self-serve API key to run your own evaluation before a sales call is telling you something about how they expect their numbers to hold up.

Comparing top multilingual STT providers for meeting intelligence

The comparison below is a directional framework, not an exhaustive or independently verified ranking, focused on the variables that most directly affect production performance for meeting assistants; vendor capabilities evolve quickly, so treat this as a starting point for your own evaluation rather than a definitive assessment.

Provider Language coverage Native code-switching Pricing model Real-time latency
Gladia (Solaria-1) 100+ languages, 42 exclusive Yes, parameter-enabled All-inclusive per-second ~270ms TTFB, ~698ms to final
OpenAI Whisper ~100 languages Supports multilingual transcription; code-switching may work but behavior can vary depending on configuration API pricing varies; model is open-source and can be self-hosted Not optimised for streaming; latency varies by hardware and chunk size (see benchmark methodology)
Google Cloud STT (Chirp) 125+ languages Yes, with optional language hints Per-minute, features vary Varies by model and region
Deepgram (Flux) 10 languages (Flux model) Not documented Per-minute, add-ons vary Sub-300ms
AssemblyAI Async: ~99 languages (Universal); Streaming: ~6 languages* Supports multilingual transcription via automatic language detection (async) but limited multilingual streaming Pay-as-you-go, per-hour / per-minute pricing ~300 ms median for streaming (multilingual)

A few notes on this table that matter for your evaluation:

  • Pricing add-ons compound quickly. Platforms that meter diarization and language detection separately can add meaningful cost at scale. Build your cost model at 100, 1,000, and 10,000 hours, factoring in the base transcription rate, per-feature charges for diarization and language detection, and any other metered add-ons, before any vendor comparison is valid.
  • Google's language detection in Chirp reportedly supports automatic detection, but the documentation notes that conditioning on specific locales improves reliability. Dynamic multilingual meetings where the language list is unknown will produce less consistent results.
  • Deepgram Flux is a specialized conversational model separate from Nova, optimized for voice agent pipelines with ultra-low latency and currently supporting 10 languages, while Nova-3 supports a broader language range across Deepgram's platform.
  • Whisper is a strong batch transcription model with the deployment control that comes from open-source, but its code-switching behavior is reportedly unpredictable and it was not designed for real-time streaming. If your meeting assistant needs live transcript lines and users frequently switch languages, Whisper's latency profile and code-switching limitations work against you.

Solving the code-switching problem with Gladia's Solaria-1

Solaria-1 was designed to handle the code-switching problem at the architecture level. By enabling the enable_code_switching parameter, the model automatically detects language changes within a single audio stream and switches the transcript context mid-sentence without requiring any language list from the calling application. You don't pre-declare which languages might appear.

We document our performance methodology on the STT benchmarks page, including separation of time-to-first-byte from latency-to-final, and validation across FLEURS and Common Voice datasets rather than clean studio recordings. Solaria-1 delivers a TTFB of approximately 270ms and latency to final of approximately 698ms on standard utterances, which fits within the latency budget for live meeting assistant use cases.

Key differentiators:

  • Language coverage: We support 100+ languages, including 42 with no coverage on any other API. Coverage of languages like Tatar and Basque matters for teams whose user base includes speakers from regions that major cloud providers consistently underserve.
  • All-inclusive audio intelligence: We include speaker diarization,automatic language detection, sentiment analysis, and named entity recognition at the base rate, so your cost model at 10,000 hours is the hourly rate multiplied by the hours with no feature surcharge calculations.
  • Production validation: Claap, a video meeting platform, reportedly reached 1-3% WER in production and transcribes one hour of video in under 60 seconds on Solaria-1. Attention, which handles AI sales analytics workflows, built on our platform to process meeting audio at scale with consistent accuracy across English and non-English calls.
"First of all, their S2T engine works great. We have tested it across many many languages (we work with commentators in pro sports around the world) and have found great accuracy even with custom fields such as team names, player names, etc. We have never come across any sort of hallucination." - Xavier G. on G2

Building a defensible global voice product

Language count is a proxy metric vendors use because it’s easy to publish and hard to dispute. The metrics that actually predict whether your meeting assistant performs well with a real distributed team are WER on accented audio in your target languages, code-switching behavior on intra-sentential switches, diarization accuracy on overlapping speech, and total cost with all required features enabled. Gladia also provides best-in-class asynchronous diarization with speaker separation across all supported languages, a must-have feature for meeting assistants.

The evaluation methodology in this guide gives you a framework to measure all four before committing engineering time. If a vendor won’t provide self-serve API access to run your own audio through their model, treat that as a signal about how their numbers hold up under conditions they didn’t select themselves.

The STT vendor buyer’s guide and the best STT APIs comparison give you additional context for structuring a full evaluation. You can also refer to our AI note-takers guide for a deeper look at how these capabilities apply specifically to meeting assistants

"Their transcription quality is the best for many languages. Their support is high quality; you can even contact their CTO, etc. It made a difference for our services." - Verified User in Higher Education on G2

Test Solaria-1 on your own multilingual meeting audio with 10 free hours. All features are included with no setup fees, no add-ons, and no hidden costs. Book a demo for a personalized walkthrough of code-switching detection and multilingual diarization.

Frequently asked questions

How accurate is multilingual STT for accented speech in production?

Accuracy typically degrades from clean-data benchmarks to real-world multi-speaker production environments, and models not trained on diverse accent data show substantially higher substitution error rates. Test on Mozilla Common Voice or Google FLEURS samples for your target accent regions to get a production-realistic signal before committing.

What is the typical latency for real-time multilingual transcription?

Competitive real-time APIs deliver TTFB between 200-300ms on standard utterances, with latency to final ranging from 698ms (Solaria-1) to over 1,000ms for Whisper-based systems. Sub-300ms TTFB is the practical threshold for a usable live meeting assistant, as higher latency makes transcript lines appear visibly delayed.

How does code-switching affect transcription accuracy?

Code-switching degrades accuracy on models that require a pre-specified language, producing either phonetic gibberish or silent translation.

What are the key features to evaluate in a multilingual STT solution?

Prioritize native code-switching support without requiring a language list, WER benchmarks on noisy multi-speaker audio, diarization accuracy on overlapping speech, all-inclusive pricing, and real-time latency benchmarks measured as TTFB and latency to final separately. A provider's self-serve API access policy is a useful signal: restricted access before a sales call often correlates with benchmarks that don't hold up on customer-selected audio.

How much test audio do I need for a statistically valid STT evaluation?

Use a minimum of 10,000 words (approximately one hour of audio) per test condition to achieve statistically meaningful WER results. Across five test conditions, that means roughly five hours of test audio distributed across at least 2,000 individual files.

Key terms glossary

Word Error Rate (WER): The standard accuracy metric for speech-to-text systems, calculated as (insertions + deletions + substitutions) divided by total reference words. Lower WER indicates higher accuracy.

Speaker diarization: The process of automatically identifying who spoke when in multi-speaker audio, assigning each word or segment to the correct speaker label. Diarization errors compound transcription errors in meeting contexts, and some providers price it as an add-on.

Code-switching: The practice of alternating between two or more languages within a conversation, either within a single sentence (intrasentential) or at sentence boundaries (intersentential). Most STT models without native code-switching support will either produce gibberish or silently translate the second language.

Real-Time Factor (RTFx): A throughput metric calculated as audio duration divided by processing time. An RTFx of 100 means the system processes 100 seconds of audio per second of compute time. RTFx above 1.0 indicates faster-than-real-time processing.

Hallucination: In the context of STT, text the model generates that was not present in the source audio, often triggered by silence, background noise, or code-switching scenarios where the model encounters phoneme patterns outside its training distribution.

Latency to final: The time elapsed from the end of a spoken utterance to the delivery of the complete, corrected transcript segment. Distinct from TTFB (time-to-first-byte), which measures when the first partial token appears. Both affect live meeting assistant usability.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more