TL;DR: Broad language coverage in an ASR model does not mean it can handle mid-sentence language changes. Code-switching breaks standard monolingual models, causing significant WER degradation relative to single-language baselines on well-studied pairs like Hindi-English, with even higher failure rates on lower-resource combinations. We built Solaria-1 to natively detect and process code-switching across 100+ languages without requiring developers to pre-specify languages in the request, with public usage-based pricing and audio intelligence features included without per-feature add-ons on paid plans.
Your QA pipeline passed on English audio. The transcripts fall apart when your bilingual users mix languages mid-sentence, which is exactly how they communicate.
When your users switch from Hindi to English mid-sentence, or from French to English mid-meeting, the transcript often breaks. This article maps which language pairs support true code-switching, why low-resource pairs remain challenging, and how to evaluate ASR systems for real-world multilingual environments.
Defining code-switching for ASR systems
Code-switching: definition and use cases
Code-switching is the practice of alternating between two or more languages within a single conversation. It breaks down into two patterns that matter for ASR design.
Intrasentential code-switching is a language change within a single sentence or clause, following grammatical rules from both languages simultaneously. A contact center example: "Can we review the Q3 forecast, s'il vous plaît?" or "Yaar, can you just reschedule the meeting to 4 baje?" The speaker uses one language's syntax while inserting vocabulary from another.
Intersentential code-switching is a language change at sentence boundaries, where each complete utterance is in one language but the conversation alternates. A team meeting example: "The latest build failed. Kya hua?" Each sentence is internally consistent, but the language flips between them.
Both patterns are normal in bilingual communities and common in the product contexts where ASR matters most: contact centers serving bilingual populations, meeting assistants used by international teams, and voice interfaces embedded in global consumer apps. Gladia's code-switching documentation describes this specifically as scenarios where "the language is changed multiple times throughout the audio," such as a conversation where two people each speak a different language.
Monolingual ASR's WER challenge
Standard ASR models fail at code-switching for a straightforward reason: they were trained on monolingual corpora and optimized for clean, single-language audio. When a speaker switches languages mid-sentence, the model's acoustic and language models are simultaneously operating outside their training distribution.
The WER impact is measurable. For Hindi-English code-switching, research documents substantial WER degradation compared to monolingual baselines. Even for structurally related Romance language pairs, the degradation is severe: code-switching evaluations show baseline models often exceeding 50% WER, with significant error rates persisting even after fine-tuning on code-switched test sets. These are not edge-case failures on unusual audio. They reflect what happens to most deployed models when real bilingual speakers use your product.
Multilingual and code-switching scenarios push models to switch between languages mid-sentence, something rarely reflected in standard benchmarks. A model evaluated only on monolingual test sets can claim accurate multilingual support while failing completely in production.
Implementing code-switching ASR: requirements
Building an ASR system that handles code-switching reliably requires training on audio containing actual language mixing, not separate monolingual datasets that happen to cover the same language pair. The deeper constraint is annotated training corpora of real code-switched speech. Code-switched speech tends to be underrepresented in many ASR training pipelines, making it a challenging scenario for models to handle effectively.
A model that has never been trained on Hinglish will not learn to handle it just because it has seen a lot of Hindi and a lot of English. The mixing patterns, syntactic borrowings, and phonological adaptations that emerge when languages blend require the model to learn from examples of the blend itself.
Which language pairs have native code-switching support?
Spanish-English code-switching metrics
Spanish-English is one of the most studied code-switching pairs, driven by the scale of bilingual populations in the United States and Latin America. Research benchmarks such as LinCE include Spanish-English code-switching data across multiple tasks and corpora, reflecting how extensively this language pair has been studied. Even so, specific WER metrics in public ASR provider documentation remain limited, and production variance is high enough that only real-world testing on your specific audio reveals how well a given system handles your users' switching patterns.
Hindi-English ASR code-switching capabilities
Hindi-English code-switching (Hinglish) dominates contact center and Business Process Outsourcing (BPO) workloads at scale, a direct result of India's large bilingual population. Despite this scale, standard ASR models fail at Hinglish because of compounding gaps in acoustic modeling, vocabulary, language modeling, and training data. Gladia’s BPO use case page specifically addresses this context, which is why Solaria-1's coverage of South Asian languages including Bengali, Punjabi, Tamil, Urdu, Marathi, and Hindi matters for Contact Center as a Service (CCaaS) platforms serving these markets.
French-English ASR: production readiness
French-English code-switching occurs in various contexts, including Canadian markets, francophone African communities, and European contexts where English is the dominant business language but French is the primary community language. Research datasets include French-English combinations, but commercial ASR systems vary significantly in how well they handle regional phonological patterns versus textbook European French.
Evaluating Tagalog/Cantonese code-switching
Tagalog-English (Taglish) and Cantonese-English code-switching expose the largest gap between "supported language" marketing claims and actual production capability. Both appear constantly in contact centers across the Philippines, Hong Kong, and diaspora communities globally, yet both involve typological distance from English that makes end-to-end model training significantly harder.
Across the research literature, a substantial proportion of CS-ASR benchmarks concern Mandarin-English, Hindi-English, and Arabic-English, with combinations like Malay-English and Tagalog-English representing the frontier of current model capability. Any vendor claiming strong performance on Taglish or Cantonese-English should be asked specifically for WER benchmarks on code-switched test sets for those pairs, not aggregated multilingual accuracy figures.
ASR limitations for emerging language pairs
The gap between high-resource and low-resource code-switching pairs is structural, not a function of model size or compute. Dedicated CS datasets are scarce because most models rely on monolingual or mixed-language corpora that fail to reflect real-world code-switching patterns. You cannot close this gap by fine-tuning on more monolingual data for either language in the pair, and for product teams building for these populations, the practical implication is direct: evaluate on real-world audio samples from your actual users, not synthetic test sets.
Why language coverage doesn't guarantee code-switching capability
Scarcity of code-switching training data
Code-switching happens in conversation, not in written text, which means it is underrepresented in virtually every large-scale training corpus that ASR models learn from. Building a competitive code-switching model requires curating, annotating, and training on actual recordings of bilingual speakers switching languages, at scale, across diverse audio conditions. This is expensive and slow, which is why most providers have not done it for more than a handful of pairs.
Why model architecture limits code-switching
The dominant architectural pattern for multilingual ASR uses a Language Identification (LID) model to detect the spoken language, then routes the audio to a separate monolingual ASR engine trained for that language. This design treats language detection as a preprocessing step, not as an integrated part of transcription.
The LID-routing model creates an inherent problem for code-switching because the language identification step operates on a window of audio, and mid-sentence language changes produce ambiguous or incorrect routing signals. The model either assigns the whole sentence to the wrong language engine or triggers a routing switch mid-utterance that disrupts transcription continuity. Gladia's automatic language detection takes a fundamentally different approach: continuous language detection within the transcription pipeline itself rather than an upfront routing decision.
ASR performance: monolingual vs. code-switched
| Scenario |
Documented WER impact |
Source basis |
| Hindi-English code-switched |
30–50% relative WER increase over Hindi baseline |
Research on Hinglish ASR |
| Catalan-Spanish code-switched |
51–63% WER on baseline models, 51.29% on fine-tuned |
CS-ASR evaluation research |
| Low-resource pairs (e.g., Malay-English) |
Highly variable, with substantial performance gaps |
CS benchmark experiments documented in the literature |
The cost to your product goes beyond the transcript. Text-based sentiment inference and named entity recognition both operate on the transcript, not the audio signal, so their accuracy is ceiling-bounded by transcription quality. Higher WER on code-switched audio degrades the transcript and can reduce the reliability of downstream analysis built on top of it.
Selecting multilingual code-switching for your product
Define your code-switching language scope
Before you evaluate any vendor, map which code-switching pairs your users actually produce. Pull support ticket language data, review session recordings, and segment user satisfaction scores by language and region. This is not market research. It is a precision requirement, because "multilingual support" varies by orders of magnitude across language pairs, and performance characteristics can differ significantly between different code-switching combinations.
Your evaluation matrix should list every language pair in priority order, weighted by user volume and the severity of a transcription failure in that context. The stakes for transcription accuracy will vary based on your specific use case and user base.
Request WER on code-switched test sets
When you ask a vendor for accuracy benchmarks, ask specifically for WER on code-switched test sets, not aggregated multilingual accuracy figures. Ask for code-switched evaluation methodology, dataset coverage, and production-like test conditions, then compare those results with Gladia’s current benchmark framework covering 8 providers, 7 datasets, and 74+ hours of audio. A vendor that cannot provide WER evidence for your target language pairs under clearly defined test conditions has not tested them in a meaningful way.
Also ask whether the evaluation audio includes natural conversations with mixed languages or only clean studio recordings. The benchmark condition should match your production environment.
Assess production code-switching accuracy
Once integrated, measure WER separately for code-switched segments versus monolingual segments in your production traffic. Build language-specific accuracy monitoring into your QA pipeline from day one so you catch regressions before they reach support ticket volume. Track your code-switching distribution over time, as the language pairs in your production traffic may shift.
Gladia's solution for multilingual code-switching
Current code-switching language pairs
We built Solaria-1 to cover 100+ languages and dialects with native code-switching support, including 42 languages you cannot get from other API-level speech-to-text (STT) providers. That coverage includes high-demand BPO and contact center languages: Tagalog, Bengali, Punjabi, Tamil, Urdu, Persian, and Marathi, as well as languages like Haitian Creole, Maori, and Javanese, where most providers have no production-grade support at all.
When you enable code-switching in our API, the model continuously detects the spoken language and switches transcription accordingly, without requiring callers to announce which language they are using or developers to route audio through a separate language identification step.
Instead of stitching together separate vendors for transcription, diarization, enrichment, and downstream structured outputs, teams can handle the audio pipeline through one API.
How Gladia measures code-switching
We evaluate Solaria-1 against other providers across 7 datasets and over 74 hours of audio, covering diverse languages, accents, and audio conditions. Our benchmark methodology is open and reproducible. Within this evaluation scope, on multilingual datasets with accented speech and challenging acoustic conditions, Solaria-1 achieves up to 29% lower WER and up to 3x lower diarization error rate (DER) compared to leading alternatives, results that reflect the specific language pairs and conditions covered by these benchmarks, and that may vary across other language combinations. Our diarization is powered by pyannoteAI's Precision-2 model.
On paid plans, customer data is not used for model training by default. For teams handling sensitive audio data, Gladia provides deployment flexibility including cloud hosting options and enterprise deployment models.
Real-world code-switching API demo
Code-switching detection can be configured in the API request:
{
"audio_url": "<your-bilingual-audio-file-url>",
"detect_language": true,
"enable_code_switching": true,
"diarization": true
}
The structured transcript response includes word-level timestamps, speaker labels, and language tags, making it easier to filter or route content by language:
{
"transcription": {
"utterances": [
{
"speaker": "A",
"text": "Can we review the Q3 forecast, s'il vous plaît?",
"words": [
{"word": "Can", "language": "en", "start": 0.0, "end": 0.3},
{"word": "s'il", "language": "fr", "start": 4.1, "end": 4.4},
{"word": "vous", "language": "fr", "start": 4.4, "end": 4.6},
{"word": "plaît", "language": "fr", "start": 4.6, "end": 4.9}
]
}
]
}
}
Evaluating multilingual ASR in production
Code-switching ASR: WER benchmarks
Production results from teams using Gladia demonstrate the system's real-world performance across multilingual scenarios.
The summarization quality, named entity recognition output, and sentiment signals your downstream AI pipeline should depend on are ceiling-bounded by transcript accuracy. A 1-3% WER in production is not just a transcription metric. It is a data quality metric for every NLP task built on top of that transcript.
For multilingual products, performance matters most on messy production audio with accents, multiple speakers, overlap, and background noise, not only on clean benchmark samples.
Supported code-switching language pairs
The table below compares code-switching support, language coverage, and pricing model with diarization and NER included. Pricing structure is based on published information as of April 2026.
| Provider |
Code-switching support |
Unique language coverage |
Pricing model |
| Gladia (Solaria-1) |
Native, automatic, across 100+ languages |
100 languages, 42 exclusive to Gladia |
All-inclusive: diarization, NER, sentiment, and translation included at no additional cost |
| Deepgram (Nova-3 Multilingual) |
Real-time code-switching across 10 languages (English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, Dutch) |
45+ languages |
Base rate varies by model; features metered separately as add-ons |
| AssemblyAI |
Code-switching capability varies by model |
Language coverage not independently verified |
Base rate plus per-feature add-ons; total cost varies by feature selection |
| Google Cloud STT |
Supports alternative language detection via alternativeLanguageCodes parameter |
125 languages |
Base rate plus additional charges depending on model and feature selection |
Vendor add-on pricing compounds across high-volume workloads: each feature metered separately makes the total bill exponentially harder to model. At 10,000 hours per month with diarization, NER, and sentiment enabled, the gap between all-inclusive and add-on-priced models is material. Our pricing page shows broad feature availability across paid plans.
Handling code-switching in async and real-time transcription
For async workloads, post-call analysis, meeting transcription, and contact center audio, processing the full recording gives Solaria-1 complete utterance context before output is finalized.
Language detection runs across the full audio span rather than on short leading chunks, which reduces mid-transcript language switches caused by ambiguous acoustic segments at segment boundaries.
Diarization accuracy also benefits: speaker turn boundaries are resolved against the full recording rather than inferred from a rolling buffer, which reduces speaker confusion in overlapping or accented speech.
For teams that need live transcription, voice agents, live captions, or live assist workflows, Solaria-1 delivers 103ms partial transcript latency and 270ms final transcript latency when measured over WebSocket connections with 16kHz audio at 30-second chunk lengths under production network conditions.
Language detection runs continuously within the transcription pipeline rather than as a preprocessing step, which removes additional routing logic at the orchestration layer. Teams building on real-time voice frameworks such as Pipecat (open-source voice AI pipeline framework), LiveKit (open-source WebRTC media infrastructure), or Vapi (commercial voice AI orchestration service) can pass audio directly without a separate language identification step upstream.
Test your own multilingual audio to see how Gladia handles automatic language detection and code-switching in production. Review pricing and start integrating with your integration running in less than a day, or review our full benchmark methodology to evaluate performance on your specific language pairs before committing.
FAQs
What is code-switching in ASR?
Code-switching in ASR is a model's ability to accurately transcribe audio when a speaker changes languages mid-conversation or mid-sentence, covering both intrasentential switches (within a single sentence) and intersentential switches (at sentence boundaries). Standard monolingual models fail at this because they were trained on single-language corpora and cannot handle the acoustic and linguistic patterns that emerge when languages blend.
How does Gladia handle code-switching?
Solaria-1 automatically detects and processes language changes across 100+ languages using a single end-to-end multilingual architecture rather than a language identification routing model. You enable it with a single parameter in the API request.
Do you charge extra for code-switching?
No. On Gladia’s paid plans, code-switching is available within the pricing model rather than as a separate per-feature add-on. Check the current pricing page for exact plan-level terms and enterprise packaging details.
Which language pairs have the most mature code-switching support in ASR?
Code-switching performance varies significantly across providers for all language pairs. Test with audio samples from your specific language combination to evaluate whether the system meets your accuracy requirements.
How do I evaluate whether an ASR vendor actually supports code-switching for my language pair?
Ask for WER benchmarks specifically on code-switched test sets rather than monolingual accuracy figures, request the benchmark methodology and dataset names, and test with real audio samples from your own users rather than clean studio recordings. Refer to benchmark methodology, dataset coverage, and code-switched evaluation conditions, then compare those results with Gladia’s benchmark framework covering 8 providers, 7 datasets, and 74+ hours of audio.
Key terms glossary
Intrasentential code-switching: Changing languages within a single sentence or clause, following the grammatical rules of both languages simultaneously.
Intersentential code-switching: Changing languages between sentence boundaries, where each complete utterance is internally consistent but the conversation alternates languages.
**Word error rate (WER):** The standard metric for ASR accuracy, calculated by adding substitutions, insertions, and deletions, then dividing by the total number of spoken words. Lower is better.
Language identification (LID) routing: An ASR architecture where a language detection model classifies audio first, then routes it to a monolingual ASR engine. This design creates accuracy and latency challenges for code-switching because mid-sentence language changes produce ambiguous routing signals.
Diarization error rate (DER): The standard metric for speaker diarization accuracy, measuring the proportion of audio incorrectly attributed to the wrong speaker. Lower is better.
Hinglish: A hybrid language variety combining Hindi and English, widely spoken across India and the South Asian diaspora. Hinglish speakers fluidly mix vocabulary, grammar, and phonology from both languages within conversations and often within individual sentences, making it a common and linguistically complex case study in automatic speech recognition and multilingual transcription systems.
Data Processing Agreement (DPA): A legally binding contract specifying how a vendor handles and protects customer data, required for GDPR compliance and typically reviewed during enterprise procurement.