Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Read more

Speech-To-Text

ElevenLabs vs Gladia: speech-to-text comparison for voice AI builders

ElevenLabs vs Gladia comparison for voice AI builders. Compare STT accuracy, latency, pricing, and features for production agents. Get real-world accuracy metrics, total cost models, and technical specs to evaluate whether unified vendor stack or best-of-breed STT fits your pipeline.

Speech-To-Text

Meeting bot speech recognition: how real-time transcription powers automated meeting assistants

Meeting bot speech recognition requires sub-300ms STT latency, real-time diarization, and code-switching for reliable transcripts. Production meeting bots fail when transcription infrastructure cannot handle multi-speaker overlap, language switching, and speaker attribution in real time.

Speech-To-Text

Meeting transcription common mistakes: what meeting assistant builders get wrong

Meeting transcription mistakes that break production systems: crosstalk handling, diarization failures, and code switching issues. Learn how to architect STT pipelines that survive real world audio conditions, avoid silent WebSocket failures, and prevent cost model surprises at scale.

What is code-switching in speech recognition?

Published on March 25, 2026
By Ani Ghazaryan
What is code-switching in speech recognition?

Code-switching in speech recognition is language alternation within utterances that breaks monolingual ASR models at switch points. End-to-end multilingual architectures handle intra-sentential switches natively without LID routing overhead, reducing WER by up to 55% at language boundaries.

TL; DR: Code-switching is the alternation between two or more languages within a single utterance or conversation. In ASR, it breaks monolingual models at language boundaries: tokenizers produce [UNK] tokens, WER spikes by 30-50%, and real-time streams stall at switch points. The root cause is architectural. LID-based cascade pipelines add compounding latency and cut words mid-switch. The fix is an end-to-end multilingual model that handles language fluidity natively without pre-specified routing. Gladia's Solaria-1 does this with a single API call and a 270ms latency target.

Most ASR models treat code-switching as an error state, resulting in high WER and hallucinated transcripts. Solving this requires moving beyond simple Language Identification (LID) to end-to-end multilingual architectures that process language fluidity natively. This article breaks down what code-switching is at a technical level, why standard ASR pipelines fail to handle it, and what the state-of-the-art architectural approaches actually look like.

What is code-switching in speech recognition?

Code-switching (CS), also called code-mixing in ASR literature, is the use of elements from more than one language within the same utterance or discourse. A bilingual speaker answering a support call might begin in English, shift to Spanish when a concept feels more natural in their first language, and then return to English to close the sentence. That single exchange contains three language transitions, and your ASR model has to handle all of them correctly.

The distinction between code-switching and code-mixing matters at the margins. Linguists use code-mixing to describe formal linguistic properties of language-contact phenomena and code-switching to describe actual spoken usage by multilingual persons. In practice, both terms are used interchangeably in the ASR community, with both referring to the same multilingual speech phenomenon your pipeline needs to handle.

For global SaaS, CCaaS, and voice AI products, code-switching shows up in Spanglish (English–Spanish), Hinglish (Hindi–English), Franglais (French–English), Tagalog–English, and many other multilingual combinations. Your users aren’t switching languages to break your model, they’re just talking the way they naturally communicate.

Defining code-switching: intra-sentential vs. inter-sentential

The first distinction you need to make in production is whether a language switch happens within a sentence or between sentences. These two types have very different difficulty profiles for ASR.

Type Where the switch occurs ASR difficulty Example
Inter-sentential At sentence boundaries Moderate "How are you? Estoy bien, gracias."
Intra-sentential Mid-sentence, mid-phrase High "I know you from my amigo en la oficina."

Inter-sentential code-switching is the switch between grammatically complete sentences, where each sentence is in a different language. The model receives a clean acoustic boundary and can reset its language assumption at the utterance level.

Intra-sentential code-switching is where the shift occurs in the middle of a sentence. Mid-sentence Mandarin-English switching, for example, forces the model to process phonemes from two different phonological systems simultaneously, with no clean boundary to work from. The acoustic variation between the two languages within a single utterance is significantly larger than across utterances, which is why intra-sentential switching is exponentially harder for ASR. The model must hold two acoustic models in context at once with no advance notice of when the switch will happen, and phone modelling data sparseness makes capturing all context-dependent phone combinations extremely difficult to resolve.

The engineering challenge: why code-switching breaks monolingual ASR

Language confusion and accent bias

A monolingual model trained predominantly on English audio will bias toward English phoneme mappings when it encounters audio it doesn't recognize. When a Spanish speaker inserts "gracias" into an English sentence, a monolingual English ASR system typically doesn't return [UNKNOWN], but instead produces output based on phonetically similar English patterns. The acoustic features often get mapped to the closest available English phoneme matches, producing nonsense output or a phonetically adjacent English word.

Monolingual ASR exhibits 30-50% higher WER on code-switched data compared to clean monolingual audio. This is the failure mode that shows up in support tickets from your non-English user segments before your accuracy metrics catch it, and by the time you see it in your dashboards, you've already lost those users. Whisper research confirmed word-level language ID failures on code-switched speech, where the model produced text but mapped it phonetically to the wrong language.

The tokenizer bottleneck

The deeper architectural problem is the tokenizer. Most end-to-end ASR models use character-based vocabularies: characters, sub-words (BPE), or words. These work well for monolingual training but break down in multilingual contexts, where vocabulary size limitations cause rare characters from character-rich languages to fall outside the model's token space entirely.

When foreign phonemes from Language B enter a tokenizer trained only on Language A, the tokenizer cannot map the incoming acoustic features to its known vocabulary and produces OOV tokens, [UNK] tokens, or character-level fallbacks. The downstream language model then receives a corrupted token sequence and either hallucinates a phonetically similar word in Language A or drops the segment. One IEEE-published study reported 177 OOV tokens per test set, which represents a structural vocabulary gap, not a rounding error in WER.

Data scarcity compounds the problem. The SwitchLingua dataset published in 2025, with 420K CS textual samples across 12 languages and over 80 hours of recordings spanning 63 ethnic groups, is one of the first attempts to address this at scale. Before datasets like SwitchLingua existed, most code-switching models were evaluated on bilingual mixing within homogeneous groups, which doesn't reflect the diversity of real production audio.

State-of-the-art technical approaches to code-switching

Frame-level language identification

The first generation of code-switching solutions used utterance-level LID: run a language classifier on the full utterance, assign one language label, and feed it to the appropriate monolingual model. This fails completely on intra-sentential switches because the utterance contains two languages and one label has to win.

Frame-level LID improves on this by identifying language at the granular level of individual acoustic frames (10-25ms slices) rather than per utterance. The LAL method in ArXiv 2403.05887 aligns acoustic features to pseudo-language labels learned from the ASR decoder during training, enabling frame-level language identification without requiring frame-level annotations. This enables more precise detection of intra-sentential switches by monitoring language changes within a single sentence rather than across sentence boundaries.

Joint ASR and LID with RNN-T shows that running LID and ASR simultaneously in a cascade system can reduce latency, but the fundamental problem with cascade designs remains: every language you add requires another monolingual model, another routing rule, and another source of compounding latency. Supporting five languages means five ASR models, one LID model, and routing logic that breaks when a switch happens faster than your LID window can commit to a label.

Concatenated tokenizers and retraining-free approaches

The ACL 2023 Concatenated Tokenizer paper proposes reusing existing monolingual tokenizers and mapping them to mutually exclusive label spaces for each language within a single model. At inference time, the token ID range tells the model which language tokenizer to apply: tokens in range 0-1023 are English, tokens in 1024-2047 are Spanish, and so on. Language ID information is embedded directly in the token ranges, achieving 98%+ LID accuracy on the out-of-distribution FLEURS dataset for spoken language identification.

The Apple ML retraining-free approach pushes this further by combining existing monolingual acoustic and language models with an LSTM-based grapheme-to-phoneme model. Their system reportedly reduced WER from 34.4% to 15.3% on their intra-sentential code-switching test set, a 55.5% relative reduction, without degrading monolingual accuracy. The test conditions, language pairs, and audio characteristics used in their evaluation differ from typical production scenarios, so your own evaluation on real audio with your specific language combinations and audio quality remains the only reliable indicator for your system.

How Gladia Solaria-1 handles code-switching natively

Solaria-1 takes the architectural approach that cascade designs avoid: it is a single end-to-end multilingual model that handles language changes without requiring a separate upstream language identification (LID) step. You don’t need to specify which languages to expect in your request, the model automatically performs robust language detection, even for accented speakers, correctly identifying languages like English spoken with a strong French accent. It detects language at the token level and outputs the correct script for whatever language is spoken, reducing common issues like language confusion and accent misclassification.

The Gladia code-switching documentation shows how to enable this with a single parameter:

{
  "audio_url": "YOUR_AUDIO_URL",
  "enable_code_switching": true,
  "code_switching_config": {
    "languages": ["en", "es", "fr"]
  }
}

If you want to narrow detection to specific language pairs, the code_switching_config.languages array limits the exploration space and improves accuracy for known bilingual combinations like Spanglish or Hinglish. Leave it unconstrained and Solaria detects across the full supported language list automatically.

The JSON response structure makes language attribution explicit at the utterance level:

{
  "utterances": [
    {
      "id": 1,
      "language": "en",
      "transcription": "Hello, how are you",
      "time_begin": 0.0,
      "time_end": 1.5
    },
    {
      "id": 2,
      "language": "es",
      "transcription": "estoy bien, gracias",
      "time_begin": 1.5,
      "time_end": 3.0
    }
  ]
}

Each utterance carries a language key, so your downstream pipeline can route text to the right NLP model, translation layer, or display component without building a separate classification step. For live transcription via WebSocket, each result object includes the type, transcription, language, and words fields in real-time, so the language label arrives with the transcript, not after it.

Production users have reported results with Gladia's multilingual support across complex audio conditions:

Solaria-1 benchmark results show a 94% Word Accuracy Rate (6% WER) average across English, Spanish, French, and other common languages. This level of accuracy is especially critical for asynchronous transcription workflows, such as meeting assistants, call recordings, and analytics pipelines, where reliable multilingual understanding directly impacts downstream insights. While low latency (around 270ms) supports real-time use cases, code-switching accuracy is equally important in async processing, where recordings are transcribed after the fact and still require precise handling of mixed-language speech. The single-model approach also removes the LID-plus-routing overhead that can introduce errors or complexity in both real-time and asynchronous pipelines.

"Excellent multilingual real-time transcription with smooth language switching... Superior accuracy on accented speech compared to competitors... Clean API, easy to integrate and deploy to production." - Yassine R. on G2
"Gladia's transcriptions cater well to multilingual requirements, thus significantly aiding our customer support in a complex multilingual setup... Our decision to switch to Gladia was influenced by how our developers found it to be a better choice for our needs compared to other speech-to-text platforms like Eleven Labs." - Pratik S. on G2
"It's an incredible fast model. We are using the speech recognition model and it's unbelievably good for single or multi-language detection. We've integrated many models into our platform at Line 21, but Gladia is definitely in the top." - Paul B. on G2

You can verify Solaria-1’s behavior against your own audio using automatic language detection alongside code-switching mode, which is designed for multilingual environments where speakers naturally switch languages, accents vary, and language boundaries are fluid. This is particularly useful when your audio distribution includes a wide range of languages and accents, even beyond explicitly tested pairs, reinforcing Solaria-1’s strong language coverage. The How ASR Handles Languages post explores the model design decisions behind this in more depth.

Measuring success: WER, latency, and hallucination rates

Standard WER on your full test set will mislead you when evaluating code-switching performance. A model that transcribes English at 3% WER but produces 45% WER on Spanish utterances will show a blended score of 8-12%, which looks acceptable until you check your Spanish user churn rate.

The measurement approach that actually reflects production performance:

  • Switch-point WER: Calculate WER specifically for the 2-3 word window immediately before, during, and after each language transition. This is where monolingual models fail most severely and where users notice degradation first.
  • Per-language WER breakdown: Report WER separately for each language in your audio distribution, because blended scores mask language-specific regressions.
  • Hallucination rate at transitions: Track the rate of phonetically plausible but semantically incorrect substitutions at switch points specifically, since these are harder to catch with WER alone because the word exists, it's just wrong.

On the latency side, the compounding effect matters most for real-time voice pipelines. Each model hop in a cascade architecture adds to your LLM's response budget. If your LLM needs 200ms and your ASR pipeline needs 400ms due to LID overhead, your voice agent is already at 600ms before network latency. A single-model path with sub-300ms final transcript latency gives you the budget to invest in LLM quality rather than architecture workarounds.

Build your multilingual ASR evaluation now

Code-switching is significantly easier to handle if you choose the right architecture. Cascade systems built on LID routing handle inter-sentential switching adequately but can struggle with intra-sentential switches, compound latency at every model hop, and growing operational complexity as new languages are added. Modern end-to-end multilingual models that handle language changes natively at the token level address many of these challenges, reducing both configuration complexity and potential performance bottlenecks.

The STT vendor evaluation guide covers the full vendor evaluation checklist, including what to look for in multilingual accuracy claims and how to pressure-test WER benchmarks against your own audio distribution. If you're comparing providers specifically for multilingual support, the best STT APIs comparison covers the architectural trade-offs across the current provider landscape.

Start with 10 free hours and test Solaria-1 against your own bilingual audio by getting started with the Gladia API. Read the language_behaviour parameter documentation to configure detection for your specific language pairs, and run the switch-point WER measurement approach above against your actual production audio samples.

FAQs

What's the difference between code-switching and code-mixing in ASR?

In practice, ASR researchers use both terms to describe the same phenomenon: alternating between two or more languages within a single utterance. For engineering purposes they describe the same pipeline challenge, and the terms are interchangeable in technical documentation.

Does enabling code-switching hurt WER on monolingual audio?

No. When implemented correctly using concatenated tokenizer architectures, enabling code-switching does not degrade monolingual accuracy, as demonstrated in the Apple ML Research paper. The non-overlapping token mappings allow language-specific weights to be separated and modified independently without retraining the full model.

What WER improvement can I expect moving from a monolingual model to a code-switching model?

Apple's research showed a reduction from 34.4% WER to 15.3% WER on intra-sentential tasks using a retraining-free approach across Mandarin-English and Hindi-English pairs, a 55.5% relative reduction. Your actual improvement depends on your specific language pairs, audio conditions, and the proportion of switch-point segments in your data.

Do I need to specify which languages to detect when using Gladia's code-switching mode?

No. You can leave the code_switching_config.languages array unconstrained and Solaria will detect across all supported languages automatically. Constraining it to specific pairs narrows the detection space and can improve accuracy for known bilingual combinations like Spanglish or Hinglish.

What's the latency impact of code-switching detection in real-time streams?

With an E2E model like Solaria-1, code-switching detection adds no additional latency because there's no separate LID step, and the 270ms latency figure covers mixed-language audio the same way it covers monolingual audio. Cascade architectures, by contrast, add a separate LID overhead of 70-200ms on top of ASR latency before routing begins.

How should I test code-switching quality for my specific language pairs?

Test WER at switch points specifically rather than overall WER, measure per-language WER separately, and track hallucination rates at transitions. The SwitchLingua benchmark provides a standardized evaluation dataset across 12 languages if you want a comparable external reference point.

Key terms glossary

Code-switching (CS): The alternation between two or more languages within a single utterance or conversation, used interchangeably with code-mixing in ASR contexts.

Intra-sentential code-switching: A language switch that occurs within a single sentence, exposing the tokenizer and acoustic model to two phonological systems simultaneously.

Inter-sentential code-switching: A language switch that occurs between grammatically complete sentences, providing a clean acoustic boundary for the model to reset its language assumption.

Language Identification (LID): A classifier that assigns a language label to an audio segment. Utterance-level LID assigns one label per utterance. Frame-level LID assigns labels per 10-25ms audio frame, enabling detection of intra-sentential switches.

Concatenated Tokenizer: An architecture that maps multiple monolingual tokenizers to mutually exclusive token ID ranges within a single model, enabling per-token language identification without a separate LID step.

Word Error Rate (WER): The standard ASR accuracy metric, calculated as the ratio of insertion, deletion, and substitution errors to total reference words. For code-switched audio, switch-point WER is more informative than overall WER.

End-to-end (E2E) multilingual model: An ASR architecture that processes mixed-language audio through a single model without routing to separate language-specific models, eliminating LID latency overhead and cascade complexity.

Hallucination: An ASR output where the model produces a phonetically plausible word in the dominant training language instead of correctly transcribing a foreign-language phoneme, most common at intra-sentential switch points in monolingual models.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more