TL;DR:
- Most STT systems are benchmarked on clean English audio, but production users switch languages mid-sentence, speak with regional accents, and operate in noisy environments.
- Reliable code-switching requires a multilingual ASR architecture that detects language per utterance and maintains context across switches.
- For meeting assistants and CCaaS platforms, async (batch) transcription processes the full audio before output is finalized, which produces better accuracy on code-switched speech than real-time.
Most engineering teams evaluate speech-to-text models on clean English datasets, only to watch their word error rates climb when real users switch languages mid-sentence. The problem is not that multilingual support is missing from vendor documentation. A vendor claiming "100 language support" means nothing if the model drops context the moment a bilingual speaker switches languages in a single breath. When the transcript layer fails, every downstream system fails with it: CRM entries get corrupted, coaching scores become unreliable, and LLM summaries lose the decisions that mattered.
This guide breaks down the technical mechanics of code-switching, how to measure accuracy across languages, and how to build an STT pipeline that holds up in production.
Code-switching: the STT accuracy blocker
Code-switching is the linguistic phenomenon where speakers alternate between two or more languages within a conversation. For your STT pipeline, it is one of the most common causes of silent accuracy degradation because single-language models struggle to handle it.
Code-switching's form in live STT data
You will encounter code-switching in two main forms in real audio, and both break monolingual ASR pipelines in different ways.
Intra-sentential switching happens within a single sentence, where a speaker starts in one language and finishes in another while adhering to both grammars. The classic example: "La onda is to fight y jambar" (Spanish-English). This is typically the harder case for ASR because the model must handle phonemes that belong to different phonological systems within a continuous utterance.
Inter-sentential switching occurs at sentence boundaries, where a speaker completes a thought in one language and opens the next in another, like "Ani wideili. What happened?" (Assyrian-English). Single-language models treat the second utterance as noise or produce phonetic guesses mapped to the wrong vocabulary.
In production CCaaS environments, bilingual agents and callers switch between languages constantly, often within a single breath rather than at a clean sentence boundary. Research on bilingual speech corpora documents how this switching pattern can systematically degrade monolingual ASR accuracy.
Technical failure modes in single-language models
Monolingual ASR typically produces higher WER on code-switched data compared to clean monolingual audio, with failures clustering into three patterns.
Vocabulary breakdown occurs when characters or phonemes from the second language fall outside the model's token vocabulary, producing [UNK] (unknown) tokens or dropped words.
Hallucination occurs when phonemes from the second language get mapped onto the primary language's vocabulary, producing plausible-looking but wrong text.
Silent omission is a failure mode that can surface in support tickets from non-English user segments before your accuracy metrics catch it: the switched segment simply disappears from the transcript.
These patterns affect specific user segments consistently rather than appearing as random noise in aggregate WER.
Quantifying code-switching accuracy gaps
Two metrics matter most when evaluating multilingual STT for downstream LLM tasks.
Word Error Rate (WER) is the conventional metric, calculated as (S+D+I)/N, where S is substitutions, D is deletions, I is insertions, and N is the total reference word count. Our WER explainer covers the full calculation methodology. WER penalizes minor errors (a missing hyphen, an alternative spelling) at the same weight as a wrong entity name that corrupts a CRM record.
Match Error Rate (MER) and Word Information Lost (WIL) are complementary ASR evaluation metrics that normalize differently from WER. MER adjusts the denominator to account for insertions and gives a cleaner view of what proportion of all word matches are errors. For teams feeding transcripts into LLM pipelines for summarization, entity extraction, or coaching scores, a vendor who quotes 5% WER on a clean English benchmark may produce substantially higher MER on a real CCaaS recording with two bilingual agents, background noise, and domain-specific vocabulary.
How multilingual speech-to-text models handle code-switching
Your model's underlying architecture determines whether it handles code-switching gracefully or falls apart.
Per-utterance language identification
Accurate language identification must run at the utterance level, not just the file level. File-level detection identifies the dominant language, while utterance-level detection tells you what language each segment is in, which is the only granularity that matters when speakers switch every few sentences. Shorter segments can give the model less acoustic evidence, potentially increasing misclassification at boundaries. You configure automatic language detection via the API without specifying a language ahead of time.
How models detect in-sentence switches
Enabling code-switching in the Gladia API is a single parameter in your transcription request:
{
"audio_url": "https://your-audio-source.com/call.wav",
"language_config": {
"languages": ["en", "es", "fr"],
"code_switching":
}
}
The response tags each utterance with its detected language and timestamps, so your downstream pipeline receives structured, language-labeled data rather than a flat, potentially garbled transcript. Constraining to a known language pair (for example, ["en", "fr"] for a European CCaaS platform) can narrow the model's hypothesis space, which may reduce boundary misclassification and improve accuracy on accented speech.
In async mode, the model processes the full recording context before producing output. That full-context pass helps resolve ambiguous code-switch boundaries and is why diarization is available in async mode. Our meeting assistant architecture guide covers the full pipeline design.
Ensuring STT reliability in real-world use
Evaluate accuracy on your proprietary data
Vendor benchmarks are a starting point, not a decision. The only test that matters for your pipeline is your own audio under your production conditions. Here is a reproducible five-step methodology:
- Assemble a representative test set. Pull several hours of real audio including noisy calls, multi-speaker meetings, and bilingual conversations with manually annotated ground truth.
- Simulate production conditions. Include background noise, overlapping speakers, and domain-specific vocabulary. Clean studio audio tells you nothing useful about real-world performance.
- Run through each provider's production endpoint with default settings and the same normalization pipeline (such as a standardized text normalizer) so results are directly comparable.
- Calculate WER and MER using a library like JiWER (a Python library for evaluating ASR systems). WER shows raw error volume, MER shows which errors affect meaning downstream.
- Stress-test on your lowest-resource language. The gap between providers on high-resource languages is small. The gap on Tagalog, Bengali, or Tamil is where the real evaluation happens.
Opaque benchmarks are an industry-wide problem: most STT vendors publish selective accuracy claims without releasing test data, evaluation code, or reproducible methodology, leaving engineering teams to guess whether a quoted WER will hold up on their audio. Gladia built an open benchmark framework to solve this problem, evaluating Solaria-1 against 8 providers across 7 datasets and 74+ hours of audio with fully reproducible methodology. The entire framework is open-sourced, so any team can reproduce the published results, run the same evaluation against their own audio, or extend the methodology to test additional providers. Reproducibility is the only way to make accuracy claims verifiable rather than marketing.
Language coverage and dialect handling
Not all "100 language support" claims are equivalent. High-resource languages like English, Spanish, French, and German have large publicly available training corpora and typically achieve lower WER on standard benchmarks. Low-resource languages with limited transcribed training data can produce substantially higher WER on models that list them nominally, because the gap between a language being "supported" and actually trained well is significant.
Solaria-1 covers 100+ supported languages including 42 not supported by many other API-level STT providers: Tagalog, Bengali, Punjabi, Tamil, Urdu, Persian, Marathi, Haitian Creole, Maori, and Javanese. For CCaaS platforms with BPO operations in the Philippines, Bangladesh, or Indonesia, that breadth is the difference between a working transcript and a manual fallback.
Accent and dialect handling is a separate challenge from language support. A model that handles standard British English well may still produce elevated WER on strong Scottish or Indian English accents, because accent variation is a function of training data distribution, not the language label. Teams building for global markets report differences in production:
"Superior accuracy on accented speech compared to competitors... Clean API, easy to integrate and deploy to production." - Yassine R. on G2
Implementing multilingual STT: API integration patterns
Code-switching STT: real-time vs. async APIs
Choosing between real-time and async is not primarily about latency preference. It is about what your use case requires and what accuracy trade-off you can tolerate.
Meeting assistants, post-call analysis, and CCaaS analytics pipelines architecturally require async transcription. Solaria-1 processes 1 hour of audio in approximately 60 seconds in async mode, giving the model full recording context before output is finalized. This produces better code-switching accuracy, better diarization, and better entity recognition. Our multilingual meeting transcription guide covers this architecture in detail.
Real-time transcription (~300ms final transcript latency) suits use cases where immediate output is architecturally required: voice agents, live captions, and live-assist workflows. Code-switching detection in real-time mode operates on shorter audio windows, which reduces the acoustic evidence available for language boundary classification and increases misclassification rates compared to batch processing. Post-call analysis and post-meeting transcription workflows do not have a latency requirement that justifies this accuracy trade-off. In these architectures, batch (async) transcription produces better code-switching accuracy, better diarization, and better entity recognition because the model processes the full recording context before finalizing output. The accuracy improvement on multilingual and code-switched audio is consistent across providers, and the latency cost is irrelevant when transcripts and summaries are generated after the interaction ends.
Speaker ID in multilingual conversations
Multilingual audio makes diarization harder because speaker boundaries and language boundaries can coincide, and models that confuse the two produce garbled speaker labels. Gladia's speaker diarization is powered by an industry leading pyannoteAI's Precision-2 model and runs in async mode only, because accurate speaker attribution benefits from full audio context.
The async diarization output returns speaker-labeled utterances with word-level timestamps:
{
"utterances": [
{
"speaker": 0,
"language": "en",
"text": "We need to finalize the contract terms.",
"start": 0.0,
"end": 3.2
},
{
"speaker": 1,
"language": "fr",
"text": "Oui, je suis d'accord avec les conditions.",
"start": 3.5,
"end": 6.8
}
]
}
Audio-to-LLM pipeline for production deployment
Structured, diarized output with language labels per utterance is what makes multilingual lead enrichment and contact center analytics work reliably. When Aircall, a cloud-based phone and communication platform, integrated Gladia across their CCaaS platform, they cut transcription time by 95% (from 30 minutes to 1.5 minutes per call) and now process 1M+ calls per week, with STT powering search, AI summaries, sentiment analysis, agent coaching, and CRM webhooks from a single API integration.
The Audio-to-LLM pipeline takes the structured transcript output and passes it directly into an LLM of your choice for downstream tasks like summaries, action items, entity extraction, and sentiment scoring. The transcript is already speaker-labeled and language-tagged from the steps above. Sentiment analysis in Gladia works from the words in the transcript, not from tone of voice or audio signal, so the output reflects what was said rather than how someone sounded. You can connect your own LLM (GPT-4, Claude, or any other model) or use a built-in option, with no requirement to use a specific model and no lock-in on the LLM side.
Build vs. buy: self-hosted multilingual models vs. managed API
Infrastructure overhead for multilingual STT
Teams often frame the build-vs-buy decision for multilingual STT as cost vs. control. The real trade-off is engineering time vs. vendor dependency, and the answer depends on whether multilingual transcription is a differentiator or a commodity for your product.
Self-hosting a multilingual model introduces a specific class of infrastructure toil that grows with language coverage: GPU provisioning for inference at scale, model version management across language updates, file size and duration limits, and DevOps capacity to monitor and remediate instability. For example, Claap achieved 1-3% WER in production using Gladia's managed API. The gap between self-hosted and managed setups is not primarily a model quality gap. It is a maintenance and optimization gap.
Build vs. buy cost analysis
Model your costs at three volumes before committing to either path. Self-hosted estimates below include GPU compute plus DevOps overhead for version management, scaling incidents, and monitoring.
| Volume |
Self-hosted reality |
Gladia Growth plan (from $0.20/hr, floor rate) |
| 1,000 hrs/month |
Single GPU instance plus part-time DevOps to cover model updates, scaling, and incident response |
$200 async, $250 real-time |
| 10,000 hrs/month |
Multi-GPU fleet with autoscaling, dedicated DevOps ownership, and ongoing model version management |
$2,000 async, $2,500 real-time |
The Gladia column reflects the Growth floor rate (from $0.20/hr), which is unlocked by committing to a minimum volume. The rate you pay scales down as your committed volume increases. At any volume, you also avoid the sprint capacity cost of maintaining inference infrastructure, and teams moving off self-hosted Whisper typically save 20%+ in DevOps effort.
Gladia vs. other managed STT APIs
The comparison below covers the four providers most likely to appear in a multilingual production evaluation.
| Provider |
Languages and code-switching |
Pricing model |
Features included |
| Gladia Solaria-1 |
100+ languages, 42 unique, true code-switching |
$0.61/hr Starter, from $0.20/hr Growth async |
Diarization, NER, sentiment, translation, code-switching all at base rate |
| OpenAI Whisper API |
99+ languages, no code-switching support, English translation only |
$0.006/min ($0.36/hr) |
No diarization, 25MB file cap |
| Deepgram Nova-3 |
45+ languages, code-switching supported across 10 languages (EN, ES, FR, DE, HI, RU, PT, JA, IT, NL) |
$0.258/hr base per Deepgram's public pricing, with add-ons billed separately |
Smart formatting included - diarization billed separately as add-on |
| AssemblyAI Universal-3 Pro |
Multilingual support: code-switching available in Universal-3 Pro Streaming (real-time), not Universal-2 |
Per-second billing at hourly rate (~$0.21/hr) |
Audio intelligence add-ons billed separately |
On the compliance side: Gladia holds SOC 2 Type II, ISO 27001, HIPAA, GDPR, and PCI certifications, with full documentation available at the Gladia compliance hub. On Growth and Enterprise plans, customer audio is never used to retrain models, with no opt-out action required. On the Starter plan, customer data can be used for model training by default. PII redaction is available as an optional feature and must be explicitly enabled in your API configuration. It is not active by default.
Clarifying multilingual STT: key considerations
Optimizing WER for code-switched audio
Three configuration choices consistently improve code-switching accuracy in production. First, pass the languages parameter with your expected language pair rather than running open-ended detection. Narrowing the hypothesis space can reduce misclassification at code-switch boundaries and improve accuracy. Second, use async for all post-meeting and post-call workflows.
Full-context processing consistently improves accuracy, diarization, and code-switching resolution, and the latency cost does not affect user experience in post-call pipelines. Third, add custom vocabulary for domain-specific terms. Product names, technical jargon, and proper nouns that appear in your audio but not in general training data are a common source of elevated WER in specialized domains, and custom vocabulary is included in the Starter and Growth base rate.
System language detection limits
Current multilingual STT has honest limits that matter for production planning. Highly specialized jargon in low-resource languages is the hardest case: a model that handles everyday Tagalog well may still produce higher WER on Tagalog medical terminology or industry-specific acronyms, because that vocabulary appears rarely in training corpora. The practical mitigation is custom vocabulary configuration and domain-specific testing before launch. Test your lowest-resource language with your actual domain vocabulary, not generic audio.
Restricting detection to specific language pairs
Restricting code-switching detection to a known language pair is the right default for most production deployments. The languages parameter in both async and real-time configurations restricts code-switching detection to your specified set. A request with "languages": ["en", "es"] tells the model to switch between English and Spanish only, which improves accuracy and reduces the chance of misclassifying an accented English word as a third language. For async requests, pass code_switching: true alongside the language list to enable continuous per-utterance detection within your specified pair. The code-switching documentation shows the full parameter schema and example responses for both modes.
Start with 10 free hours and have your integration in production in less than a day. Test Solaria-1 on your own multilingual audio to see how it handles language detection, accent-heavy speech, and code-switching before committing to a vendor decision.
FAQs
What is the most accurate STT API for non-English languages?
Solaria-1 covers 100+ languages with average 29% lower WER on conversational speech than alternatives, benchmarked across 8 providers, 7 datasets, and 74+ hours of audio. For low-resource languages including Tagalog, Bengali, Tamil, and Javanese, Gladia supports 42 languages not available through any other API-level STT provider.
How does code-switching affect transcription costs?
It depends on whether your provider charges for code-switching detection as a separate feature. On Gladia's Starter and Growth plans, code-switching detection is included in the base per-hour rate with no add-on fees, so at 10,000 hours/month on Growth, the all-in cost starts from $2,000/month at the Growth floor rate (subject to commitment tier) versus providers who bill diarization and enrichment features separately on top of the base transcription rate.
Does async or real-time transcription handle code-switching better?
Async (batch) produces consistently better code-switching accuracy because the model processes the full audio context before committing to output. Real-time mode at ~300ms latency supports code-switching but works on shorter audio windows, which increases language boundary misclassification. For meeting assistants and CCaaS post-call analysis, async is the right default.
What is Gladia's data policy for multilingual audio on paid plans?
On Growth and Enterprise plans, customer audio is never used to retrain Solaria-1, with no opt-out action required. On the Starter plan, customer data can be used for model training by default.
Key terms glossary
Word Error Rate (WER): The standard ASR evaluation metric, which counts substitutions, deletions, and insertions relative to the total reference word count. A lower WER indicates fewer transcription errors, but WER does not weight errors by their downstream impact on meaning.
Match Error Rate (MER): A complementary metric to WER that adjusts for insertions in the denominator. MER gives a cleaner view of what proportion of all word matches are errors and is more sensitive than WER to failures that affect downstream LLM tasks.
Code-switching: The practice of alternating between two or more languages within a conversation, either within a single sentence (intra-sentential) or at sentence boundaries (inter-sentential). For STT systems, code-switching is one of the primary causes of accuracy degradation in production.
Diarization: The process of segmenting audio by speaker identity, labeling who said what and when. In Gladia, diarization is powered by pyannoteAI Precision-2 and is available in async mode only.
Low-resource language: A language with limited publicly available transcribed training data, resulting in higher WER on ASR models compared to high-resource languages. Examples include Javanese, Haitian Creole, and Maori.
Async transcription: A batch transcription workflow where the full audio file is processed before output is finalized. Async mode enables full-context language detection, higher diarization accuracy, and better code-switching resolution compared to real-time streaming.