TL;DR: Code-switching breaks traditional ASR stems that assign a single language at the session level, and the common workaround, a separate Language Identification (LID) model layered before transcription adds latency and complexity at each stage, with errors from the LID step propagating into the transcript. End-to-end multilingual models that detect language changes natively remove the routing layer entirely, maintaining WER accuracy across 100+ languages without manual language configuration, while detect-then-transcribe pipelines accumulate maintenance overhead in proportion to the number of languages they support.
Your QA pipeline passed on English audio. The support tickets from bilingual users switching between Spanish and English mid-sentence are telling a different story. That gap between test conditions and production reality is where most multilingual transcription failures live, and the architecture of your detection pipeline determines how wide that gap becomes.
Most teams encountering code-switching failures are running async workflows: meeting assistants, support call analytics, compliance review pipelines, where recordings are processed post-session. Accuracy in those pipelines determines the quality of every downstream output: summaries, sentiment scores, entity extraction, compliance flags. Real-time voice agents face the same underlying detection problem, but they represent a narrower slice of where code-switching errors actually surface in production.
The more common failure mode is a meeting assistant that produces a coherent English summary of a conversation that was 40% Spanish, or a compliance pipeline that misattributes speaker intent because the model silently failed to switch language context mid-call. These failures don't generate immediate errors, but they generate quietly wrong outputs that QA catches weeks later, if at all. Understanding how detection architecture affects accuracy in those async contexts is where the diagnostic work starts.
What is code-switching in speech recognition?
Code-switching is the practice of alternating between two or more languages within a single conversation. It's not a signal of linguistic confusion; it reflects how fluent speakers communicate when they share multiple languages.
For ASR systems, code-switching creates a structural problem: the model needs to know which language it's transcribing before it can apply the correct acoustic model, phoneme set, and language model probabilities. When the language changes unexpectedly, those assumptions break.
Intra-utterance vs. inter-utterance switching
The distinction between these two forms determines how difficult detection actually is.
Inter-utterance switching happens at sentence or clause boundaries. A speaker finishes a complete thought in one language and begins the next in another. Example: "Ani wideili [Assyrian]. What happened? [English]" (illustrative example) This pattern gives a LID model a natural pause to recalibrate before the next segment begins, so detection has a chance to catch up with the speaker.
Intra-utterance switching happens within a single sentence, often at the word or morpheme level. Example: "La onda is to fight y jambar" ("The latest fad is to fight and steal") (illustrative example) shows how a speaker moves between languages inside a single syntactic structure. By the time an external LID model detects the switch and routes the audio segment, the phoneme context may already be incorrect and the words that triggered the switch have been decoded against the incorrect language model, which is why intra-utterance switching is commonly the failure mode most teams discover through customer complaints rather than internal testing.
Why code-switching breaks traditional ASR systems
Traditional ASR models typically work with a single language assignment that persists for the duration of the audio. The model learns English, or it learns Hindi, but it doesn't learn the space between them, and that's where production failures accumulate.
The language identification bottleneck
A common workaround is a separate LID model that runs on audio before or alongside the transcription model, detects which language is present, and routes audio to the correct ASR endpoint.
This pipeline has a fundamental structural problem: the LID model typically needs to identify the language before transcription can begin, creating challenges when languages switch mid-utterance.
Each language has distinct phoneme inventories and sound patterns. When a speaker says "Estoy checking my email," mixed-language input can create phoneme mapping challenges as the ASR attempts to force sounds into a single language's phoneme lattice. This phoneme mismatch can increase error likelihood on mixed-language input.
The pipeline also introduces stitching complexity: segmented audio from multiple language-specific ASR endpoints must be reassembled in the correct order with consistent timestamps, and each step adds latency and creates new failure surfaces for your engineering team to maintain.
Impact on word error rate and downstream NLP
Single-language decoding can produce errors on code-switched audio when the model attempts to map one language's phoneme patterns to audio containing a different language, potentially generating plausible-sounding but incorrect text.
WER on code-switched segments may increase compared to monolingual baselines when the detection layer fails to catch a language change in time, and these transcription errors can affect downstream NLP tasks that depend on that segment, from sentiment analysis to named entity recognition to compliance workflows where extracted entities feed CRM records.
Code-switching detection algorithms and techniques
Several algorithmic approaches address the detection problem, each with different accuracy and latency trade-offs.
Phonetic and acoustic feature analysis
Some acoustic-phonetic approaches reportedly analyze the audio's spectral characteristics to detect the phoneme inventory shifts that occur when a speaker changes languages.
Phonotactic information may describe permissible phoneme sequences in a specific language, potentially adding a higher-level signal that pure spectral features can miss.
These approaches may work reasonably well on clean audio with clear phonetic contrasts between the two languages involved but can degrade on accented speech where a bilingual speaker's phonetic realizations blend characteristics of both languages, which reportedly occurs in call center and meeting audio where speakers use their native language's intonation patterns while speaking a second language.
Gladia’s benchmark methodology evaluates performance across multiple providers and multilingual datasets covering diverse accents and audio conditions, as outlined in its published benchmark.
Language model confidence scoring and BERT-LID
ASR models generate word-level confidence scores between 0 and 1, representing the model's certainty about each decoded token. Some approaches use confidence drops below a set threshold on consecutive tokens to indicate that the primary language model may have lost traction on the audio, which can signal a language switch. These systems may then trigger a secondary language check to re-decode the flagged segment.
BERT-based Language Identification (BERT-LID) extends this by applying contextual language models to the partially decoded text rather than the raw audio.
BERT's bidirectional attention reportedly identifies language-specific lexical and syntactic patterns even when phonetic features are ambiguous. One reported limitation is that BERT-LID operates on text output, which means it can only detect a switch after some text has already been decoded, which may lead to higher error rates on the initial words of a code-switched segment.
Token-level language tagging and boundary detection
Modern approaches reportedly integrate language identification directly into the decoding process, with some models designed to predict a language tag at each token position alongside the text token itself. Rather than running a separate detection pass, such models are designed to output both the word and the language it belongs to within a single inference step.
This architecture is designed to enable native code-switching handling, and the Gladia documentation on automatic language detection describes how this works in practice: the model identifies language boundaries without requiring manual language configuration or pre-set language lists per session.
How to detect code-switching in audio: implementation approaches
The architectural decision between building a detection-then-transcribe pipeline versus using a model with native handling directly affects your latency budget, WER, and engineering overhead. Two patterns dominate current implementations.
Detection-then-transcribe (post-hoc):
- LID model segments audio by detected language before transcription begins
- Routing layer sends each segment to a language-specific ASR endpoint
- Stitching layer reassembles results with aligned timestamps
- Latency penalty: LID inference, segmentation, routing, and stitching add cumulative overhead before any transcript is available
- Failure mode: Intra-utterance switches occur faster than LID inference cycles can complete, so the first tokens of a switched segment are decoded against the wrong language model
Native end-to-end multilingual:
- Single model handles detection and transcription in one inference pass
- Token-level language tags predict language alongside text at each position
- No segmentation or stitching: The model maintains language context across the full utterance
- Latency advantage: No separate LID stage eliminates the pre-transcription overhead entirely
- Accuracy improvement: Removes the failure window where LID holds the wrong language assignment for several tokens before correcting
For async workflows processing meeting recordings, support call archives, or compliance audio, a post-hoc LID pipeline introduces an additional inference stage that can reduce accuracy on intra-utterance switches, the failure window where the wrong language assignment persists for several tokens is most damaging when transcripts feed downstream sentiment analysis or entity extraction. For voice agent pipelines targeting very low end-to-end latency (STT + LLM + TTS), post-hoc LID also adds overhead before transcription begins, potentially constraining the overall latency budget.
Evaluating commercial speech-to-text APIs for mixed-language audio
When you evaluate commercial APIs for a mixed-language production workload, headline English WER numbers tell you very little about what you'll see in production. Important considerations include which languages the model was actually trained on, how it handles intra-utterance switches, and what the true cost looks like once you add diarization and NER to the transcription base rate. Native code-switching matters not only because it reduces latency, but because it removes the separate LID vendor or model from the stack entirely, which eliminates an extra routing layer, reduces stitching errors, and simplifies the architecture for mixed-language production workloads.
| API |
Code-switching support |
Pricing model |
Real-time |
Add-on fees |
| Gladia (Solaria-1) |
Native, 100+ languages, no config required |
Starter: $0.61/hr (async) / $0.75/hr (real-time); Growth: from $0.20/hr (async) / $0.25/hr (real-time) |
Yes, low-latency real-time transcription for live workflows |
Diarization, NER, sentiment, translation, summarization, and code-switching detection available across plans |
| OpenAI Whisper |
Limited; reported hallucination risk on mixed-language audio |
$0.36/hr |
Yes |
— |
| AssemblyAI |
Not documented for intra-utterance switching |
$0.15/hr base |
Yes |
Separate add-ons |
| Deepgram |
Not documented for intra-utterance switching |
~$0.258/hr base (Nova-3) |
Yes |
Separate add-ons |
Capability assessments are based on publicly available documentation at time of writing and are subject to change.
Pricing structures vary significantly across providers. Some bundle features in the base rate while others charge separately for capabilities like speaker labels, entity detection, and summarization.
The Gladia rates shown above represent Starter and Growth tiers respectively; see the Gladia pricing page for current details and additional plan options. Check current pricing documentation for each provider against your specific feature requirements.
OpenAI Whisper and NVIDIA NeMo
Whisper has reported limitations on mixed-language input, reported hallucination-like behavior on low-signal audio at language boundaries, based on community and research observations, particularly where the model encounters phoneme patterns from multiple languages in the same utterance. Community and research reports suggest this behavior can affect WER on mixed-language transcripts, which is worth evaluating against your specific workload if code-switching is common. While OpenAI's Realtime API provides real-time transcription capabilities, the underlying model's handling of language boundaries remains an open evaluation question for multilingual production use cases.
On silence and low-energy audio specifically, Whisper's hallucination behavior is most pronounced at language boundaries where one speaker pauses and another begins in a different language, the model tends to fill low-signal frames with plausible-sounding tokens from its dominant training distribution rather than producing an empty or uncertain output, which compounds WER on mixed-language transcripts.
NVIDIA NeMo offers multilingual modeling approaches that reportedly support code-switching functionality, though detailed configuration and implementation specifics vary by use case.
AssemblyAI and Deepgram
Both AssemblyAI and Deepgram offer multilingual transcription with real-time capability and are capable APIs for English-primary workloads. The technical steps to migrate from Deepgram and from AssemblyAI are documented based on customer integration patterns. Where the two providers differ most from Gladia is in how add-on features are priced at scale.
AssemblyAI's $0.15/hr base rate and Deepgram's ~$0.258/hr base rate both exclude diarization, NER, and sentiment analysis. At 1,000 hours per month with those features enabled, stacking those add-ons changes the effective hourly rate materially beyond the headline figure for both providers, a factor worth modeling before committing to a base rate that excludes the features your pipeline requires.
How Gladia handles code-switching natively
Solaria-1 architecture and language boundary detection
Solaria-1 detects language switches automatically across all 100+ supported languages without requiring you to specify a language list or configure per-session language parameters. The model identifies language boundaries at the word level and tags each segment with the detected language as an ISO 639-1 code in the structured API response, which you can pass directly to downstream NLP tasks like sentiment analysis or NER without a separate pre-processing step.
The response format includes ISO 639-1 language codes at both the utterance and word level:
{
"utterances": [
{
"text": "Estoy checking my email",
"start": 0.0,
"end": 2.1,
"language": "es",
"words": [
{"word": "Estoy", "start": 0.0, "end": 0.4, "language": "es"},
{"word": "checking", "start": 0.5, "end": 0.9, "language": "en"},
{"word": "my", "start": 1.0, "end": 1.2, "language": "en"},
{"word": "email", "start": 1.3, "end": 1.7, "language": "en"}
]
}
]
}
This is an illustrative example of Spanish-English code-switching and how the response structure represents it. Refer to the Gladia language detection documentation for the full schema and configuration options.
The 100+ language coverage includes languages commonly associated with code-switching with English: Tagalog, Bengali, Punjabi, Tamil, Urdu, and Marathi, among others. For meeting assistants and analytics pipelines, this coverage means post-session transcripts accurately reflect language switches without manual correction before downstream sentiment analysis or entity extraction runs.
Predictable unit economics for multilingual pipelines
The pricing model removes a variable that makes multilingual pipelines hard to model at scale. Gladia uses usage-based hourly pricing, with features including diarization, translation, sentiment analysis, named entity recognition, and code-switching detection included across paid plans. See the pricing page for the complete feature breakdown by tier.
Diarization runs alongside the async transcription pipeline and is powered by pyannoteAI's Precision-2 model. Speaker attribution does not require a separate API call or add-on configuration at supported tiers, check the pricing page for the full feature breakdown per plan.
At 1,000 hours per month, the cost model looks like this:
| Provider |
Base transcription |
Diarization |
NER + Sentiment |
1,000-hr total |
Gladia (Starter, async) |
$0.61/hr |
Included |
Included |
~$610 (ceiling) |
| Gladia (Growth, async) |
from $0.20/hr |
Included |
Included |
from ~$200 (ceiling) |
| AssemblyAI |
$0.15/hr |
Add-on |
Add-on |
$150 base + add-ons |
| Deepgram (Nova-3) |
~$0.258/hr |
Included |
Add-on |
~$258 base + add-ons |
Based on the pricing examples above, feature-metered providers can end up materially above their headline base rate once add-ons are included. For Gladia's Starter tier, $610 is the ceiling at 1,000 hours; for the Growth tier, the ceiling starts from $200 for the same workload depending on volume. For feature-metered competitors, their base rate is the floor.
On paid plans, customer data is not used for model training by default. Enterprise deployments can enable zero data retention and stricter data residency configurations where required. These controls are relevant to compliance reviews against GDPR, SOC 2 Type 2, HIPAA, and ISO 27001.
Evaluating any STT provider against your own production audio, covering the language pairs, accent distributions, and code-switching patterns your users actually produce is the most reliable basis for a deployment decision.
FAQs
What is the latency impact of code-switching detection using a post-hoc LID pipeline?
The primary cost is accuracy degradation: segmentation errors and stitching artifacts accumulate across the LID inference, routing, and stitching stages before a final transcript is produced. Latency becomes a secondary compounding factor in real-time cases. Solaria-1 handles detection and transcription in a single inference pass, which avoids both the added latency and the additional failure surface introduced by a separate LID stage.
How many languages does Gladia support for real-time code-switching?
Solaria-1 supports code-switching across 100+ languages in both real-time and async modes, with no manual language configuration required.
Does Gladia's pricing change when code-switching detection is enabled?
No. Code-switching is included within Gladia’s pricing model rather than charged as a separate per-feature add-on.
What datasets does Gladia use to evaluate multilingual WER?
Solaria-1 is evaluated in Gladia’s published benchmark methodology across 8 providers, 7 datasets, and 74+ hours of multilingual audio. The evaluation includes Common Voice 24, VoxPopuli Cleaned AA, Earnings22 Full, Earnings22 Cleaned AA, Multilingual LibriSpeech, Switchboard, and the Pipecat STT Benchmark. The full benchmark methodology is publicly available.
Key terminology
Language Identification (LID): A model or component that classifies which language is present in an audio segment or text string. In traditional ASR pipelines, LID runs as a separate pre-processing step before transcription begins, which introduces latency and creates failure surfaces at every language boundary.
Intra-utterance code-switching: A language change that occurs within a single sentence or utterance, often at the word or morpheme level. This is significantly harder for post-hoc detection approaches to handle than inter-utterance switching, which occurs at sentence boundaries where a LID model has a natural pause to recalibrate.
Word Error Rate (WER): The standard metric for ASR accuracy, calculated as (Substitutions + Deletions + Insertions) / Total Reference Words. WER is most meaningful when reported with a specific language, audio condition, and benchmark dataset, because clean-audio WER numbers don't predict performance on accented, noisy, or mixed-language production audio.
Token-level language tagging: An architecture where the ASR model is designed to predict a language identifier alongside each decoded text token, which can enable language boundary detection within a single inference pass rather than through a separate detection stage. This approach aims to support native code-switching handling without post-hoc LID routing overhead.