Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Sign up

Get started

Vonage call transcription: adding real-time speech-to-text to Vonage

TL;DR: Integrating our speech-to-text infrastructure with the Vonage Voice API replaces fragmented recording, transcription, and enrichment stacks with a single API. By routing Vonage WebSocket streams directly to our endpoint, contact centers achieve approximately 270ms real-time latency for live agent assistance, or use post-call batch processing for automated QA scoring. Streaming is the right choice for live superviso. Async is the right choice when speaker-attributed QA scoring and full call context matter more than latency.

Speech-To-Text

Key data extraction: accurately extracting names, account numbers, and intents from calls

TL;DR: Downstream contact center automation fails silently when the transcription layer misinterprets a name, transposes a digit, or attributes speech to the wrong speaker. Every QA scorecard, CRM entry, and coaching signal is ceiling-bounded by the accuracy of the layer beneath it. A wrong digit or phonetic name substitution propagates into every CRM field and compliance event that follows. Extraction precision is capped by transcription quality: Solaria-1 delivers on average 29% lower WER on conversational speech and 3x lower DER than alternatives, benchmarked across 8 providers, 7 datasets, and 74+ hours of audio.

Speech-To-Text

Amazon Connect transcription: real-time speech-to-text for AWS contact centers

TL;DR: Contact centers using Amazon Connect struggle with high transcription costs and poor multilingual accuracy when relying on native tools. Routing audio via Kinesis Video Streams or S3 to Solaria-1 eliminates the Lambda 15-minute timeout risk and removes per-feature add-on costs. On conversational speech, Solaria-1 delivers on average 29% lower WER than alternatives, benchmarked across 7 datasets and 74+ hours of audio.

Code-switching in contact centers: why customer calls fail transcription

Published on March 25, 2026

by Ani Ghazaryan

Code-switching in contact centers: why customer calls fail transcription

Code-switching in contact centers causes transcription failures that inflate AHT, create compliance gaps, and break AI tools. Native multilingual models handle language transitions without routing overhead, eliminating accuracy drops that cost you hours in manual rework and hidden compliance risk.

TL; DR: Code-switching, where speakers shift between two or more languages mid-conversation, is standard behavior in global contact centers, not an edge case. Traditional ASR models are architected for one language at a time, so when agents and customers mix Spanish into English or French into Arabic, accuracy drops sharply and downstream AI fails silently. The transcript looks complete but is missing the most emotionally charged parts of the call. Native multilingual models like Gladia's Solaria-1 handle language transitions inside a single model path with 120–180ms latency, eliminating the routing complexity that kills both accuracy and AHT.

Your sentiment analysis tool rates a furious customer "Neutral" because the insults were in Spanish. Your compliance scan misses a threat because it was spoken in French. A QA score lands on your desk for a call where 30% of the audio was never transcribed accurately. These aren't hypothetical failures. They are the downstream cost of transcription architecture built for monolingual conversations, not for how bilingual speakers actually talk.

Defining code-switching in the modern contact center

Code-switching is the process of shifting from one linguistic code to another depending on social context or the conversational setting. In contact centers, it happens constantly and automatically, especially in markets like the US, Southeast Asia, Latin America, and North Africa where bilingual populations are the norm.

Three distinct patterns appear in call recordings:

‍Intersentential switching: A complete switch at sentence boundaries. "The package is delayed. Lo siento, no tengo más información." English sentence, then Spanish sentence.‍
Intrasentential switching: A mid-sentence switch that still follows grammatical rules. "I can help you with that, pero necesito tu número de cuenta before we proceed."‍
Tag switching: A word or tag phrase from one language inserted into another. "Your appointment is confirmed for tomorrow at 2 PM, ¿entiendes?"

All three appear in real support calls. Intrasentential switching is the hardest for ASR to handle because the language boundary falls inside a grammatical unit, not between them.

The chameleon effect: why agents switch languages without thinking

The chameleon effect describes the nonconscious behavioral mimicry of an interaction partner's postures, mannerisms, and speech patterns. In contact centers, this means agents naturally mirror a customer's language choice the moment they detect a shift. Research on the chameleon effect confirms that this mirroring builds rapport and increases satisfaction. It's not a policy violation. It's good customer service executing the way the human brain is wired.

Why traditional transcription engines fail on mixed audio

The core architectural problem is simple: monolingual ASR models train on monolingual datasets and operate under a hard constraint where each inference pass assumes a single active language. When a speaker switches, the model doesn't detect the switch. It continues applying the phoneme probabilities of the original language to sounds that belong to a different one.

The result is hallucination, where the model forces foreign phonemes into the most probable words in its training language, or drops the audio segment entirely. The model is operating correctly within its design. The design just doesn't match the audio.

Research on code-switched ASR confirms this is an unresolved problem for legacy architectures: accuracy declines with code-switching due to pronunciation variation that falls outside the monolingual model's acoustic space. In production, according to Hamming AI's ASR benchmarking analysis, recognition accuracy drops from 95% to 72% when code-switching is encountered, triggering incorrect intent routing and a collapse in task completion rates. That 23-percentage-point drop on a single call type scales quickly across a contact center handling thousands of bilingual interactions daily.

The latency penalty of Language ID routing

Some teams try to solve this with a "Language Identification + Routing" pipeline: detect the language first and then route to the appropriate monolingual model. Research on cascade multilingual architectures shows this introduces a strict blocking dependency where ASR cannot begin decoding until the LID module confirms which model to load.

Latency benchmarks for speech pipelines place typical cascade architectures at 380 to 450ms end-to-end, compared to 120 to 180ms for unified multilingual systems handling similar workloads. That extra routing overhead can matter in real-time voice workflows, but for many transcription and meeting-analysis use cases, especially asynchronous ones, the more important factor is whether the system maintains accuracy when languages shift mid-call. In these scenarios, reliable code-switching handling has a greater impact on overall performance than marginal latency differences.

The LID + routing approach also fails on intrasentential switching. By the time the LID module identifies a language, the sentence is already mid-flight, and the model has already begun decoding with the wrong acoustic priors.

The operational cost of unaddressed code-switching

When transcription systems lack multilingual robustness and accurate language detection, you're left with invisible gaps that compound across three specific cost centers:

‍Inflated AHT from manual rework: When a transcript fails on mixed-language audio, agents spend additional time correcting or recreating call records by hand. The summarization step, which AI is supposed to automate, reverts to a manual task. These failures are often driven by poor handling of code-switching and misidentified languages, increasing workload and reducing efficiency at scale.‍
Compliance blind spots from “dark data”: When code-switched segments aren’t transcribed accurately due to weak multilingual handling or incorrect language detection, compliance systems scanning for flagged phrases can miss critical content. A regulatory disclosure spoken in French may not register in your audit trail, leaving gaps caused by unreliable multilingual transcription rather than visible system errors.‍
Missed sentiment and insights from incomplete transcripts: When transcription systems struggle with multilingual audio, sentiment analysis and entity extraction break down, leading to incomplete or misleading insights. This directly impacts decision-making in analytics and conversation intelligence workflows.‍
Agent burnout accelerated by manual correction work: Asking bilingual agents to fix transcription errors caused by poor multilingual performance adds to workload complexity. Turnover impacts AHT and CSAT, and replacing agents is costly, but the root cause often traces back to systems that cannot reliably handle real-world language switching.

Labor represents 70%+ of overhead. Adding manual rework to each mixed-language call multiplies that cost without any visible line item, making multilingual robustness and accurate language detection critical for controlling operational expenses at scale.

PII redaction and compliance systems can flag sensitive phrases regardless of which language they were spoken in. QA teams can score calls they currently have to skip because multilingual content is now reliably transcribed and processed.

Solving the problem: architecture for code-switching

The comparison below shows the practical difference between the two approaches:

Criteria	LID + routing (legacy)	Native multilingual model
Architecture	LID model, router, monolingual ASR	Single model, all languages
Latency	380–450ms (cascade overhead)	120–180ms (unified path)
Intrasentential accuracy	Fails (mid-sentence switches)	Handles natively
Configuration required	Language pre-selection or LID tuning	Single parameter
Code-switching support	Limited to sentence boundaries	Full intra- and inter-sentential

How Solaria-1 addresses this

Gladia's Solaria-1 model handles code-switching natively across 100+ languages, without a separate language detection step. According to Gladia's STT benchmarks, Solaria-1 delivers a median time-to-final of 698ms with a Time To First Byte of approximately 270ms and partial transcripts in under 103ms.

Enabling code-switching requires a single parameter (code_switching: true) in the session configuration, as detailed in Gladia's code-switching documentation. You don't tell the model which languages to expect. Once enabled, it detects and transcribes language transitions as they happen, including mid-sentence switches where LID-based pipelines fail.

For automatic language detection, the same architecture applies: language identification is embedded in the model rather than treated as a blocking pre-processing step.

The downstream impact is direct. With accurate mixed-language transcripts, sentiment and emotion analysis can score the full emotional content of a call, not just the English portions.

"Excellent multilingual real-time transcription with smooth language switching... Superior accuracy on accented speech compared to competitors." - Yassine R. on G2

"It's an incredible fast model... it's unbelievably good for single or multi-language detection." - Paul B. on G2

Our blog post on Solaria-1 covers the model's design philosophy around accent robustness and real-time language switching. For teams experiencing ASR language bias in their current stack, this is the architectural explanation for why bias accumulates in monolingual systems.

The real-time transcription walkthrough on our YouTube channel demonstrates code-switching performance in the playground without writing any code.

A framework for evaluating code-switching accuracy

Before you run a vendor evaluation, establish exactly what you're measuring. Most WER benchmarks are run on clean, monolingual audio. Code-switching accuracy requires a different test setup.

‍Build a representative test set. Pull 50-100 real calls from your highest-volume bilingual markets and manually transcribe a sample to create a reference. Don't use synthetic audio. As Hamming AI's testing guide notes, models trained on standard conditions perform dramatically worse on regional accents and background noise, so your test audio must match your production conditions.‍
Measure WER at transition points specifically. Overall WER on a bilingual call can look acceptable even when the transitions are broken. Calculate WER on the two words before and after each language switch separately. That's where models fail.‍
Test intrasentential and intersentential switching separately. If a vendor only handles sentence-boundary switches, they'll pass an intersentential test and fail yours in production.‍
Test downstream quality and latency together. Run sentiment analysis and entity extraction on transcripts from each vendor, and measure end-to-end latency on calls with frequent language switches. If non-English segments are hallucinated, sentiment scores will look plausible but be wrong. If latency spikes during switches, there's a routing step somewhere in the pipeline.

The cumulative risk of leaving this unaddressed

Code-switching isn't a low-frequency failure mode you can deprioritize. In any contact center serving bilingual markets, it's present on a significant share of calls from certain customer segments. The data loss accumulates quietly because transcripts appear complete. There's no error message when your model hallucinated a word for a foreign phoneme. The transcript moves on as if nothing happened.

You can measure the unit economics of compliance exposure, degraded QA scores, inflated AHT, and agent burnout. What these costs share is that none of them appear clearly in your current vendor bill or your engineering sprint board. They show up in customer satisfaction drops, in audit findings, and in turnover numbers.

Testing your existing stack against a representative set of bilingual calls is the fastest way to quantify the gap. Use your own audio, measure WER at transition points, and run sentiment analysis on the output. The score tells you exactly what you're working with.

Test Gladia on your own multilingual audio to see how it handles automatic language detection, accent-heavy speech, and code-switching in practice. This is the fastest way to evaluate whether your current stack is missing parts of the conversation.

Frequently asked questions

What is the difference between code-switching and code-mixing?

The terms overlap in academic usage, though "code-mixing" typically refers to intrasentential switching (within a single sentence) while "code-switching" covers all three types including sentence-boundary and tag switches. In ASR evaluation contexts, treat them as the same failure mode requiring the same architectural fix.

How does code-switching affect sentiment analysis?

When a transcript misses or mistranscribes code-switched segments, sentiment analysis can only evaluate the portion of the call that was transcribed. In an otherwise English call, frustration expressed in a second language may be missed or underrepresented in the transcript, which can reduce confidence in the sentiment output. For teams using sentiment thresholds for escalation, that creates a risk of incomplete signal.

Can standard ASR models handle Spanglish?

Not reliably. Monolingual models struggle with code-switched speech because they apply single-language phoneme probabilities to audio that contains sounds from a different phoneme inventory. The model doesn’t know a switch has occurred, so it continues generating plausible words in its training language rather than transcribing what was actually said. A breakdown of code-switching types shows how mid-sentence switches represent a known failure point for single-language architectures, which are not designed to handle the abrupt phoneme and vocabulary shifts that occur when a speaker transitions between languages within a single utterance. Gladia’s Solaria-1 handles these transitions natively, using a single multilingual model with automatic language detection, ensuring accurate transcription even when speakers switch languages mid-sentence.

Key terminology

Code-switching: The process of shifting between two or more languages or dialects within a conversation, either between sentences (intersentential), within a sentence (intrasentential), or through inserted tags (tag-switching). In contact centers, it's a natural rapport-building behavior, not an error.

Chameleon effect: The nonconscious behavioral mimicry of an interaction partner's speech patterns and mannerisms. In customer service, agents automatically mirror a customer's language choice because doing so increases rapport and satisfaction.

Language Identification (LID): A pre-processing model that classifies which language is present in an audio segment before routing to a transcription model. In cascade architectures, LID introduces blocking latency and fails on intrasentential switches.

Word Error Rate (WER): The primary accuracy metric for ASR, calculated as the number of substitutions, deletions, and insertions divided by the number of reference words. For code-switched audio, measure WER at language transition points specifically, not just overall, to detect where failures concentrate.

Contact us

Your request has been registered

A problem occurred while submitting the form.

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Newsletter

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

Accept

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

New model: Solaria-3

Test our real-time and async transcription

2026 Meeting Assistant Report

Read more

Vonage call transcription: adding real-time speech-to-text to Vonage

Key data extraction: accurately extracting names, account numbers, and intents from calls

Amazon Connect transcription: real-time speech-to-text for AWS contact centers

Code-switching in contact centers: why customer calls fail transcription

Defining code-switching in the modern contact center

The chameleon effect: why agents switch languages without thinking

Why traditional transcription engines fail on mixed audio

The latency penalty of Language ID routing

The operational cost of unaddressed code-switching

Solving the problem: architecture for code-switching

How Solaria-1 addresses this

A framework for evaluating code-switching accuracy

The cumulative risk of leaving this unaddressed

Frequently asked questions

Key terminology

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.