TL; DR: Code-switching, where speakers shift between two or more languages mid-conversation, is standard behavior in global contact centers, not an edge case. Traditional ASR models are architected for one language at a time, so when agents and customers mix Spanish into English or French into Arabic, accuracy drops sharply and downstream AI fails silently. The transcript looks complete but is missing the most emotionally charged parts of the call. Native multilingual models like Gladia's Solaria-1 handle language transitions inside a single model path with 120–180ms latency, eliminating the routing complexity that kills both accuracy and AHT.
Your sentiment analysis tool rates a furious customer "Neutral" because the insults were in Spanish. Your compliance scan misses a threat because it was spoken in French. A QA score lands on your desk for a call where 30% of the audio was never transcribed accurately. These aren't hypothetical failures. They are the downstream cost of transcription architecture built for monolingual conversations, not for how bilingual speakers actually talk.
Defining code-switching in the modern contact center
Code-switching is the process of shifting from one linguistic code to another depending on social context or the conversational setting. In contact centers, it happens constantly and automatically, especially in markets like the US, Southeast Asia, Latin America, and North Africa where bilingual populations are the norm.
Three distinct patterns appear in call recordings:
- Intersentential switching: A complete switch at sentence boundaries. "The package is delayed. Lo siento, no tengo más información." English sentence, then Spanish sentence.
- Intrasentential switching: A mid-sentence switch that still follows grammatical rules. "I can help you with that, pero necesito tu número de cuenta before we proceed."
- Tag switching: A word or tag phrase from one language inserted into another. "Your appointment is confirmed for tomorrow at 2 PM, ¿entiendes?"
All three appear in real support calls. Intrasentential switching is the hardest for ASR to handle because the language boundary falls inside a grammatical unit, not between them.
The chameleon effect: why agents switch languages without thinking
The chameleon effect describes the nonconscious behavioral mimicry of an interaction partner's postures, mannerisms, and speech patterns. In contact centers, this means agents naturally mirror a customer's language choice the moment they detect a shift. Research on the chameleon effect confirms that this mirroring builds rapport and increases satisfaction. It's not a policy violation. It's good customer service executing the way the human brain is wired.
Why traditional transcription engines fail on mixed audio
The core architectural problem is simple: monolingual ASR models train on monolingual datasets and operate under a hard constraint where each inference pass assumes a single active language. When a speaker switches, the model doesn't detect the switch. It continues applying the phoneme probabilities of the original language to sounds that belong to a different one.
The result is hallucination, where the model forces foreign phonemes into the most probable words in its training language, or drops the audio segment entirely. The model is operating correctly within its design. The design just doesn't match the audio.
Research on code-switched ASR confirms this is an unresolved problem for legacy architectures: accuracy declines with code-switching due to pronunciation variation that falls outside the monolingual model's acoustic space. In production, according to Hamming AI's ASR benchmarking analysis, recognition accuracy drops from 95% to 72% when code-switching is encountered, triggering incorrect intent routing and a collapse in task completion rates. That 23-percentage-point drop on a single call type scales quickly across a contact center handling thousands of bilingual interactions daily.
The latency penalty of Language ID routing
Some teams try to solve this with a "Language Identification + Routing" pipeline: detect the language first and then route to the appropriate monolingual model. Research on cascade multilingual architectures shows this introduces a strict blocking dependency where ASR cannot begin decoding until the LID module confirms which model to load.
Latency benchmarks for speech pipelines place typical cascade architectures at 380 to 450ms end-to-end, compared to 120 to 180ms for unified multilingual systems handling similar workloads. That extra routing overhead can matter in real-time voice workflows, but for many transcription and meeting-analysis use cases, especially asynchronous ones, the more important factor is whether the system maintains accuracy when languages shift mid-call. In these scenarios, reliable code-switching handling has a greater impact on overall performance than marginal latency differences.
The LID + routing approach also fails on intrasentential switching. By the time the LID module identifies a language, the sentence is already mid-flight, and the model has already begun decoding with the wrong acoustic priors.
The operational cost of unaddressed code-switching
When transcription systems lack multilingual robustness and accurate language detection, you're left with invisible gaps that compound across three specific cost centers:
- Inflated AHT from manual rework: When a transcript fails on mixed-language audio, agents spend additional time correcting or recreating call records by hand. The summarization step, which AI is supposed to automate, reverts to a manual task. These failures are often driven by poor handling of code-switching and misidentified languages, increasing workload and reducing efficiency at scale.
- Compliance blind spots from “dark data”: When code-switched segments aren’t transcribed accurately due to weak multilingual handling or incorrect language detection, compliance systems scanning for flagged phrases can miss critical content. A regulatory disclosure spoken in French may not register in your audit trail, leaving gaps caused by unreliable multilingual transcription rather than visible system errors.
- Missed sentiment and insights from incomplete transcripts: When transcription systems struggle with multilingual audio, sentiment analysis and entity extraction break down, leading to incomplete or misleading insights. This directly impacts decision-making in analytics and conversation intelligence workflows.
- Agent burnout accelerated by manual correction work: Asking bilingual agents to fix transcription errors caused by poor multilingual performance adds to workload complexity. Turnover impacts AHT and CSAT, and replacing agents is costly, but the root cause often traces back to systems that cannot reliably handle real-world language switching.
Labor represents 70%+ of overhead. Adding manual rework to each mixed-language call multiplies that cost without any visible line item, making multilingual robustness and accurate language detection critical for controlling operational expenses at scale.
PII redaction and compliance systems can flag sensitive phrases regardless of which language they were spoken in. QA teams can score calls they currently have to skip because multilingual content is now reliably transcribed and processed.
Solving the problem: architecture for code-switching
The comparison below shows the practical difference between the two approaches:
| Criteria |
LID + routing (legacy) |
Native multilingual model |
| Architecture |
LID model, router, monolingual ASR |
Single model, all languages |
| Latency |
380–450ms (cascade overhead) |
120–180ms (unified path) |
| Intrasentential accuracy |
Fails (mid-sentence switches) |
Handles natively |
| Configuration required |
Language pre-selection or LID tuning |
Single parameter |
| Code-switching support |
Limited to sentence boundaries |
Full intra- and inter-sentential |
How Solaria-1 addresses this
Gladia's Solaria-1 model handles code-switching natively across 100+ languages, without a separate language detection step. According to Gladia's STT benchmarks, Solaria-1 delivers a median time-to-final of 698ms with a Time To First Byte of approximately 270ms and partial transcripts in under 103ms.
Enabling code-switching requires a single parameter (code_switching: true) in the session configuration, as detailed in Gladia's code-switching documentation. You don't tell the model which languages to expect. Once enabled, it detects and transcribes language transitions as they happen, including mid-sentence switches where LID-based pipelines fail.
For automatic language detection, the same architecture applies: language identification is embedded in the model rather than treated as a blocking pre-processing step.
The downstream impact is direct. With accurate mixed-language transcripts, sentiment and emotion analysis can score the full emotional content of a call, not just the English portions.
"Excellent multilingual real-time transcription with smooth language switching... Superior accuracy on accented speech compared to competitors." - Yassine R. on G2
"It's an incredible fast model... it's unbelievably good for single or multi-language detection." - Paul B. on G2
Our blog post on Solaria-1 covers the model's design philosophy around accent robustness and real-time language switching. For teams experiencing ASR language bias in their current stack, this is the architectural explanation for why bias accumulates in monolingual systems.
The real-time transcription walkthrough on our YouTube channel demonstrates code-switching performance in the playground without writing any code.
A framework for evaluating code-switching accuracy
Before you run a vendor evaluation, establish exactly what you're measuring. Most WER benchmarks are run on clean, monolingual audio. Code-switching accuracy requires a different test setup.
- Build a representative test set. Pull 50-100 real calls from your highest-volume bilingual markets and manually transcribe a sample to create a reference. Don't use synthetic audio. As Hamming AI's testing guide notes, models trained on standard conditions perform dramatically worse on regional accents and background noise, so your test audio must match your production conditions.
- Measure WER at transition points specifically. Overall WER on a bilingual call can look acceptable even when the transitions are broken. Calculate WER on the two words before and after each language switch separately. That's where models fail.
- Test intrasentential and intersentential switching separately. If a vendor only handles sentence-boundary switches, they'll pass an intersentential test and fail yours in production.
- Test downstream quality and latency together. Run sentiment analysis and entity extraction on transcripts from each vendor, and measure end-to-end latency on calls with frequent language switches. If non-English segments are hallucinated, sentiment scores will look plausible but be wrong. If latency spikes during switches, there's a routing step somewhere in the pipeline.
The cumulative risk of leaving this unaddressed
Code-switching isn't a low-frequency failure mode you can deprioritize. In any contact center serving bilingual markets, it's present on a significant share of calls from certain customer segments. The data loss accumulates quietly because transcripts appear complete. There's no error message when your model hallucinated a word for a foreign phoneme. The transcript moves on as if nothing happened.
You can measure the unit economics of compliance exposure, degraded QA scores, inflated AHT, and agent burnout. What these costs share is that none of them appear clearly in your current vendor bill or your engineering sprint board. They show up in customer satisfaction drops, in audit findings, and in turnover numbers.
Testing your existing stack against a representative set of bilingual calls is the fastest way to quantify the gap. Use your own audio, measure WER at transition points, and run sentiment analysis on the output. The score tells you exactly what you're working with.
Test Gladia on your own multilingual audio to see how it handles automatic language detection, accent-heavy speech, and code-switching in practice. This is the fastest way to evaluate whether your current stack is missing parts of the conversation.
Frequently asked questions
What is the difference between code-switching and code-mixing?
The terms overlap in academic usage, though "code-mixing" typically refers to intrasentential switching (within a single sentence) while "code-switching" covers all three types including sentence-boundary and tag switches. In ASR evaluation contexts, treat them as the same failure mode requiring the same architectural fix.
How does code-switching affect sentiment analysis?
When a transcript misses or mistranscribes code-switched segments, sentiment analysis can only evaluate the portion of the call that was transcribed. In an otherwise English call, frustration expressed in a second language may be missed or underrepresented in the transcript, which can reduce confidence in the sentiment output. For teams using sentiment thresholds for escalation, that creates a risk of incomplete signal.
Can standard ASR models handle Spanglish?
Not reliably. Monolingual models struggle with code-switched speech because they apply single-language phoneme probabilities to audio that contains sounds from a different phoneme inventory. The model doesn’t know a switch has occurred, so it continues generating plausible words in its training language rather than transcribing what was actually said. A breakdown of code-switching types shows how mid-sentence switches represent a known failure point for single-language architectures, which are not designed to handle the abrupt phoneme and vocabulary shifts that occur when a speaker transitions between languages within a single utterance. Gladia’s Solaria-1 handles these transitions natively, using a single multilingual model with automatic language detection, ensuring accurate transcription even when speakers switch languages mid-sentence.
Key terminology
Code-switching: The process of shifting between two or more languages or dialects within a conversation, either between sentences (intersentential), within a sentence (intrasentential), or through inserted tags (tag-switching). In contact centers, it's a natural rapport-building behavior, not an error.
Chameleon effect: The nonconscious behavioral mimicry of an interaction partner's speech patterns and mannerisms. In customer service, agents automatically mirror a customer's language choice because doing so increases rapport and satisfaction.
Language Identification (LID): A pre-processing model that classifies which language is present in an audio segment before routing to a transcription model. In cascade architectures, LID introduces blocking latency and fails on intrasentential switches.
Word Error Rate (WER): The primary accuracy metric for ASR, calculated as the number of substitutions, deletions, and insertions divided by the number of reference words. For code-switched audio, measure WER at language transition points specifically, not just overall, to detect where failures concentrate.