Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Get started

Speech-To-Text

Vonage call transcription: adding real-time speech-to-text to Vonage

TL;DR: Integrating our speech-to-text infrastructure with the Vonage Voice API replaces fragmented recording, transcription, and enrichment stacks with a single API. By routing Vonage WebSocket streams directly to our endpoint, contact centers achieve approximately 270ms real-time latency for live agent assistance, or use post-call batch processing for automated QA scoring. Streaming is the right choice for live superviso. Async is the right choice when speaker-attributed QA scoring and full call context matter more than latency.

Speech-To-Text

Key data extraction: accurately extracting names, account numbers, and intents from calls

TL;DR: Downstream contact center automation fails silently when the transcription layer misinterprets a name, transposes a digit, or attributes speech to the wrong speaker. Every QA scorecard, CRM entry, and coaching signal is ceiling-bounded by the accuracy of the layer beneath it. A wrong digit or phonetic name substitution propagates into every CRM field and compliance event that follows. Extraction precision is capped by transcription quality: Solaria-1 delivers on average 29% lower WER on conversational speech and 3x lower DER than alternatives, benchmarked across 8 providers, 7 datasets, and 74+ hours of audio.

Speech-To-Text

Amazon Connect transcription: real-time speech-to-text for AWS contact centers

TL;DR: Contact centers using Amazon Connect struggle with high transcription costs and poor multilingual accuracy when relying on native tools. Routing audio via Kinesis Video Streams or S3 to Solaria-1 eliminates the Lambda 15-minute timeout risk and removes per-feature add-on costs. On conversational speech, Solaria-1 delivers on average 29% lower WER than alternatives, benchmarked across 7 datasets and 74+ hours of audio.

Multilingual meeting transcription: language coverage, accuracy, and code-switching challenges

Published on March 25 2026

by Ani Ghazaryan

Multilingual meeting transcription: language coverage, accuracy, and code-switching challenges

Multilingual meeting transcription requires testing code-switching, accented speech, and diarization on real audio before committing. Standard WER benchmarks degrade 2.8 to 5.7x in production, so evaluate APIs on your own noisy meeting recordings to avoid user churn from accuracy failures.

TL; DR: Languages supported tells you almost nothing about how a meeting assistant performs for a distributed global team. The failure points that drive user churn are code-switching (switching languages mid-sentence), accented speech on low-bandwidth microphones, and diarization errors in multi-speaker calls. Standard WER benchmarks on clean studio datasets routinely overstate real-world accuracy, production environments with accented speech, background noise, and overlapping speakers consistently produce materially higher error rates than published figures suggest. APIs that price diarization and language detection as add-ons can cost significantly more than their advertised rate at scale. Test on your own noisy, accented, code-switched meeting audio before you commit to any vendor.

Building a meeting assistant for a distributed workforce requires more than a long list of supported languages. It requires an audio pipeline capable of navigating code-switching, heavy accents, and overlapping speech in real-time, and most standard vendor evaluations won't tell you whether your chosen API can handle any of those conditions. This guide gives you the framework to test multilingual STT performance accurately, before your users deliver the verdict through support tickets.

Two baseline metrics anchor this evaluation. WER (word error rate) measures accuracy as the sum of insertions, deletions, and substitutions divided by the total reference word count. RTFx (real-time factor) measures throughput as audio duration divided by processing time, so an RTFx of 100 means 100 seconds of audio processed per second of compute time. Both matter here, but neither means much if you're measuring them on the wrong data.

Why standard WER benchmarks fail for global meeting assistants

The most commonly cited STT accuracy figures come from datasets like LibriSpeech, which is clean, read speech recorded from audiobooks. Models routinely hit 95%+ accuracy on LibriSpeech, and that number circulates freely in marketing materials. What those materials omit is the gap between that dataset and the audio your users actually produce.

Standard benchmarks exclude:

Overlapping speech from multiple speakers
Low-bandwidth microphones (typical Zoom or Teams quality after compression)
Background noise from offices, cafes, and street environments
Non-native accents and syllable-timed speech patterns

A systematic review of clinical ASR systems published in JAMIA found consistent performance degradation when moving from controlled benchmark conditions to real-world production environments. For meeting transcription specifically, the AMI Meeting Corpus is a far more honest benchmark because it captures multi-speaker dynamics and realistic room acoustics, while LibriSpeech and CallHome, two commonly cited ASR test sets, exhibit up to 10% WER difference for the same engine.

Speaker diarization compounds this problem. Diarization is the process of identifying who spoke when in multi-speaker audio, attributing each word or segment to the correct speaker label. When diarization errors stack on top of transcription errors in a six-person meeting, the output becomes unusable even at modest WER levels. A vendor's homepage WER figure measured on clean English audio tells you very little about accuracy on a Tuesday afternoon Zoom call with four non-native English speakers and someone dialling in from a train.

The technical challenges of multilingual transcription in production

Two technical challenges determine whether a multilingual STT system survives contact with real global meeting audio: accent handling and code-switching. Standard models trained primarily on native-speaker audio fail on both, often in ways that don't surface until users start submitting support tickets.

Handling accented speech and non-native speakers

Accent robustness is a baseline requirement for global meeting assistants, not an edge case. The acoustic properties of accented speech create specific failure modes that standard models aren't designed to handle.

One well-documented example: Indian English often exhibits a syllable-timed rhythm rather than the stress-timed rhythm common to most English accents, and may show less distinction between aspirated and non-aspirated sounds. Models not trained on these patterns misattribute syllable boundaries and produce substitution errors that compound across a 60-minute meeting transcript.

Mozilla Common Voice and Google FLEURS both provide diverse accent coverage and are publicly available for evaluation. A model that performs well on FLEURS across your target language regions gives you a more reliable signal than a clean-data WER score.

"Gladia deliver real time highly accurate transcription with minimal latency, even across multiple languages and accents, The API is straightforward and well documented, Making integration into our internal tools quick and easy." - Faes W. on G2

Gladia has been recognized by users for its strong performance with accents, one Reddit user specifically highlighted how well it handles diverse pronunciations compared to standard models.

Code-switching and why most models fail

Code-switching is what happens when a speaker alternates between two languages within a conversation or within a single sentence. Linguists distinguish two main types: intrasentential switching occurs within a sentence ("Tengo que ir to the mall"), while intersentential switching occurs at sentence boundaries ("If I'm late, pues, ni modo"). Both types appear regularly in multilingual business meetings and both expose the architectural limits of most STT models.

When a model lacks native code-switching support, it fails predictably. The model locks onto the first detected language and attempts to transcribe the second language using the phonetic vocabulary of the first, producing phonetic gibberish. The second failure mode is unintended translation: the model silently converts the second language into the first rather than preserving the original.

This pattern appears consistently in Whisper user reports. GitHub discussions indicate that enabling the transcribe task has no effect when multiple languages are present, with output defaulting to translation regardless of intent. A separate thread reports the same issue: when French and English are both spoken, the model translates the entire audio into French. The Whisper repository notes that the model alternates between transcription and translation tasks depending on training, which makes code-switching behavior fundamentally unpredictable.

For a meeting assistant, this means a bilingual sales call with a Latin American prospect can produce a hallucinated English transcript with no indication that the source audio contained Spanish.

How to evaluate multilingual STT performance: A checklist for product leaders

Real-time vs. asynchronous processing trade-offs

Live meeting assistants require final transcript delivery in under 300ms for real-time note display and voice agent response. Asynchronous (batch) processing is more affordable and optimal for post-meeting summaries, action items, and other use cases where users don’t mind waiting, allowing higher accuracy since the model has access to more audio context.

Real-time models operate on short context windows to meet latency requirements, which reduces their ability to resolve ambiguous phonemes or speaker attribution. Async models process the full recording before returning results, which is why batch WER is typically lower than streaming WER for the same model and the same audio.

If you're building a live assistant, test time-to-first-byte (TTFB) separately from latency to final. Both affect the user experience in live contexts, and conflating them will produce misleading vendor comparisons.

Testing methodology: Datasets and noise conditions

Vendor-provided sample audio only tells you how the API performs on vendor-selected audio. Test on your own production data, or data that closely matches it, to get the signal your users will actually experience.

A structured evaluation should cover five distinct audio conditions:

‍Accented speech test: Source audio from speakers with the accents most common in your user base. Test WER per accent group separately using Mozilla Common Voice samples for your target languages.‍
Code-switching test: Use clips where speakers switch languages mid-sentence, not just between sentences. Verify the transcript preserves the original language rather than translating.‍
Multi-speaker crosstalk test: Use clips with 2-3 seconds of overlapping speech to check diarization accuracy and transcription quality during interruptions.‍
Low-bandwidth audio test: Downsample clean audio to 8kHz (standard VoIP quality) and re-run transcription. This simulates typical Zoom or Teams audio after compression.‍
Background noise test: Mix clean audio with office or cafe background noise at varied signal-to-noise ratios. The CHiME Challenge benchmarks are the standard reference for noisy real-world conditions.

Minimum test volume: A standard evaluation uses at least 10,000 words (roughly one hour of audio) per condition. Test each provider with identical audio through an identical harness, logging TTFB and latency to final separately. Any provider that won't give you a self-serve API key to run your own evaluation before a sales call is telling you something about how they expect their numbers to hold up.

Comparing top multilingual STT providers for meeting intelligence

The comparison below is a directional framework, not an exhaustive or independently verified ranking, focused on the variables that most directly affect production performance for meeting assistants; vendor capabilities evolve quickly, so treat this as a starting point for your own evaluation rather than a definitive assessment.

Provider	Language coverage	Native code-switching	Pricing model	Real-time latency
Gladia (Solaria-1)	100+ languages, 42 exclusive	Yes, parameter-enabled	All-inclusive per-second	~270ms TTFB, ~698ms to final
OpenAI Whisper	~100 languages	Supports multilingual transcription; code-switching may work but behavior can vary depending on configuration	API pricing varies; model is open-source and can be self-hosted	Not optimised for streaming; latency varies by hardware and chunk size (see benchmark methodology)
Google Cloud STT (Chirp)	125+ languages	Yes, with optional language hints	Per-minute, features vary	Varies by model and region
Deepgram (Flux)	10 languages (Flux model)	Not documented	Per-minute, add-ons vary	Sub-300ms
AssemblyAI	Async: ~99 languages (Universal); Streaming: ~6 languages*	Supports multilingual transcription via automatic language detection (async) but limited multilingual streaming	Pay-as-you-go, per-hour / per-minute pricing	~300 ms median for streaming (multilingual)

A few notes on this table that matter for your evaluation:

‍Pricing add-ons compound quickly. Platforms that meter diarization and language detection separately can add meaningful cost at scale. Build your cost model at 100, 1,000, and 10,000 hours, factoring in the base transcription rate, per-feature charges for diarization and language detection, and any other metered add-ons, before any vendor comparison is valid.‍
Google's language detection in Chirp reportedly supports automatic detection, but the documentation notes that conditioning on specific locales improves reliability. Dynamic multilingual meetings where the language list is unknown will produce less consistent results.‍
Deepgram Flux is a specialized conversational model separate from Nova, optimized for voice agent pipelines with ultra-low latency and currently supporting 10 languages, while Nova-3 supports a broader language range across Deepgram's platform.‍
Whisper is a strong batch transcription model with the deployment control that comes from open-source, but its code-switching behavior is reportedly unpredictable and it was not designed for real-time streaming. If your meeting assistant needs live transcript lines and users frequently switch languages, Whisper's latency profile and code-switching limitations work against you.

Solving the code-switching problem with Gladia's Solaria-1

Solaria-1 was designed to handle the code-switching problem at the architecture level. By enabling the enable_code_switching parameter, the model automatically detects language changes within a single audio stream and switches the transcript context mid-sentence without requiring any language list from the calling application. You don't pre-declare which languages might appear.

We document our performance methodology on the STT benchmarks page, including separation of time-to-first-byte from latency-to-final, and validation across FLEURS and Common Voice datasets rather than clean studio recordings. Solaria-1 delivers a TTFB of approximately 270ms and latency to final of approximately 698ms on standard utterances, which fits within the latency budget for live meeting assistant use cases.

Key differentiators:

‍Language coverage: We support 100+ languages, including 42 with no coverage on any other API. Coverage of languages like Tatar and Basque matters for teams whose user base includes speakers from regions that major cloud providers consistently underserve.‍
All-inclusive audio intelligence: We include speaker diarization,automatic language detection, sentiment analysis, and named entity recognition at the base rate, so your cost model at 10,000 hours is the hourly rate multiplied by the hours with no feature surcharge calculations.‍
Production validation: Claap, a video meeting platform, reportedly reached 1-3% WER in production and transcribes one hour of video in under 60 seconds on Solaria-1. Attention, which handles AI sales analytics workflows, built on our platform to process meeting audio at scale with consistent accuracy across English and non-English calls.

"First of all, their S2T engine works great. We have tested it across many many languages (we work with commentators in pro sports around the world) and have found great accuracy even with custom fields such as team names, player names, etc. We have never come across any sort of hallucination." - Xavier G. on G2

Building a defensible global voice product

Language count is a proxy metric vendors use because it’s easy to publish and hard to dispute. The metrics that actually predict whether your meeting assistant performs well with a real distributed team are WER on accented audio in your target languages, code-switching behavior on intra-sentential switches, diarization accuracy on overlapping speech, and total cost with all required features enabled. Gladia also provides best-in-class asynchronous diarization with speaker separation across all supported languages, a must-have feature for meeting assistants.

The evaluation methodology in this guide gives you a framework to measure all four before committing engineering time. If a vendor won’t provide self-serve API access to run your own audio through their model, treat that as a signal about how their numbers hold up under conditions they didn’t select themselves.

The STT vendor buyer’s guide and the best STT APIs comparison give you additional context for structuring a full evaluation. You can also refer to our AI note-takers guide for a deeper look at how these capabilities apply specifically to meeting assistants

"Their transcription quality is the best for many languages. Their support is high quality; you can even contact their CTO, etc. It made a difference for our services." - Verified User in Higher Education on G2

Test Solaria-1 on your own multilingual meeting audio with 10 free hours. All features are included with no setup fees, no add-ons, and no hidden costs. Book a demo for a personalized walkthrough of code-switching detection and multilingual diarization.

Frequently asked questions

How accurate is multilingual STT for accented speech in production?

Accuracy typically degrades from clean-data benchmarks to real-world multi-speaker production environments, and models not trained on diverse accent data show substantially higher substitution error rates. Test on Mozilla Common Voice or Google FLEURS samples for your target accent regions to get a production-realistic signal before committing.

What is the typical latency for real-time multilingual transcription?

Competitive real-time APIs deliver TTFB between 200-300ms on standard utterances, with latency to final ranging from 698ms (Solaria-1) to over 1,000ms for Whisper-based systems. Sub-300ms TTFB is the practical threshold for a usable live meeting assistant, as higher latency makes transcript lines appear visibly delayed.

How does code-switching affect transcription accuracy?

Code-switching degrades accuracy on models that require a pre-specified language, producing either phonetic gibberish or silent translation.

What are the key features to evaluate in a multilingual STT solution?

Prioritize native code-switching support without requiring a language list, WER benchmarks on noisy multi-speaker audio, diarization accuracy on overlapping speech, all-inclusive pricing, and real-time latency benchmarks measured as TTFB and latency to final separately. A provider's self-serve API access policy is a useful signal: restricted access before a sales call often correlates with benchmarks that don't hold up on customer-selected audio.

How much test audio do I need for a statistically valid STT evaluation?

Use a minimum of 10,000 words (approximately one hour of audio) per test condition to achieve statistically meaningful WER results. Across five test conditions, that means roughly five hours of test audio distributed across at least 2,000 individual files.

Key terms glossary

Word Error Rate (WER): The standard accuracy metric for speech-to-text systems, calculated as (insertions + deletions + substitutions) divided by total reference words. Lower WER indicates higher accuracy.

‍Speaker diarization: The process of automatically identifying who spoke when in multi-speaker audio, assigning each word or segment to the correct speaker label. Diarization errors compound transcription errors in meeting contexts, and some providers price it as an add-on.

Code-switching: The practice of alternating between two or more languages within a conversation, either within a single sentence (intrasentential) or at sentence boundaries (intersentential). Most STT models without native code-switching support will either produce gibberish or silently translate the second language.

Real-Time Factor (RTFx): A throughput metric calculated as audio duration divided by processing time. An RTFx of 100 means the system processes 100 seconds of audio per second of compute time. RTFx above 1.0 indicates faster-than-real-time processing.

Hallucination: In the context of STT, text the model generates that was not present in the source audio, often triggered by silence, background noise, or code-switching scenarios where the model encounters phoneme patterns outside its training distribution.

Latency to final: The time elapsed from the end of a spoken utterance to the delivery of the complete, corrected transcript segment. Distinct from TTFB (time-to-first-byte), which measures when the first partial token appears. Both affect live meeting assistant usability.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Speech-To-Text

Vonage call transcription: adding real-time speech-to-text to Vonage

Speech-To-Text

Key data extraction: accurately extracting names, account numbers, and intents from calls

Speech-To-Text

Amazon Connect transcription: real-time speech-to-text for AWS contact centers

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

New model: Solaria-3

Test our real-time and async transcription

2026 Meeting Assistant Report

Read more

Vonage call transcription: adding real-time speech-to-text to Vonage

Key data extraction: accurately extracting names, account numbers, and intents from calls

Amazon Connect transcription: real-time speech-to-text for AWS contact centers

Multilingual meeting transcription: language coverage, accuracy, and code-switching challenges

Why standard WER benchmarks fail for global meeting assistants

The technical challenges of multilingual transcription in production

Handling accented speech and non-native speakers

Code-switching and why most models fail

How to evaluate multilingual STT performance: A checklist for product leaders

Real-time vs. asynchronous processing trade-offs

Testing methodology: Datasets and noise conditions

Comparing top multilingual STT providers for meeting intelligence

Solving the code-switching problem with Gladia's Solaria-1

Building a defensible global voice product

Frequently asked questions

Key terms glossary

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.