Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Sign up

Get started

Call center transcription software: what enterprises should look for in 2026

TL;DR: Most contact centers evaluate transcription software using clean-audio lab benchmarks, then watch QA automation break down when BPO (Business Process Outsourcing) agents switch languages mid-call or phone-line noise degrades the signal. In 2026, the criteria that matter are real-world multilingual WER, all-inclusive per-hour pricing, and data sovereignty that holds up under GDPR and HIPAA audit. For enterprise teams, the highest-ROI evaluation step is testing on real BPO call samples rather than vendor demo audio, and asking every shortlisted provider for an all-in per-hour price with diarization, sentiment, and entity extraction enabled.

Speech-To-Text

PII redaction for call recordings: how ingestion-level redaction keeps calls PCI compliant

TL;DR: Legacy pause-and-resume systems don't remove agents, local desktops, or telephony infrastructure from PCI DSS audit scope. Automated, ingestion-level PII redaction scrubs sensitive data before it reaches any database. By removing cardholder data at the ingestion layer, contact center platforms using automated redaction can potentially reduce audit complexity, cut agent handle time (AHT), and protect downstream CRM and LLM pipelines from corrupt data. The accuracy floor for reliable entity detection in PCI audits is significantly higher than for standard QA transcription, making STT model selection a compliance decision as much as a product one.

Speech-To-Text

GDPR, SOC 2, and ISO 27001 speech-to-text: the contact center compliance and certification guide

TL;DR: When your contact center routes voice data through a transcription vendor, every certification gap in that vendor's stack becomes your compliance liability. Voice recordings qualify as personal data under GDPR Article 4, and processing them through uncertified APIs creates direct financial exposure. This guide breaks down what GDPR, SOC 2 Type II, ISO 27001, HIPAA, and PCI DSS each require of your audio infrastructure vendor and maps those requirements to the QA coverage rates and cost-per-contact metrics you manage daily. We hold GDPR, SOC 2 Type II, ISO 27001, HIPAA, and PCI DSS certifications, and never use customer audio for model training on Growth or Enterprise plan.

Code-switching vs. language identification: what's the difference?

Published on April 1, 2026

Ani Ghazaryan

Code-switching vs. language identification: what's the difference?

Code-switching detection transcribes multilingual speech accurately. Language identification routes audio but fails mid-sentence switches.

TL;DR: Language identification (LID) detects the primary language in an audio file and routes it to a monolingual model. Code-switching detection transcribes accurately when a speaker changes languages mid-sentence, which LID alone can't handle. Standard ASR systems fail on code-switched audio, producing WER regressions that degrade sentiment analysis, NER, and diarization. Gladia's Solaria-1 handles native code-switching across 100+ languages, with strong performance on multilingual and accented speech, without requiring routing pipelines or additional feature configuration.

Your QA pipeline passed on English audio. The support tickets from bilingual users switching between English and Tagalog mid-conversation tell a different story, and the WER regressions don't just affect the transcript. They cascade into every downstream feature your product depends on: sentiment scores go wrong, named entities get dropped, and diarization breaks on the speaker turn that triggered the language switch.

The root cause is an architectural mismatch. Most vendors sell "multilingual support" that is actually language identification, routing audio to separate monolingual models. That architecture works until a speaker changes languages mid-utterance, at which point the pipeline receives input it was never trained to handle. Understanding this distinction before you commit to a vendor is the difference between a production system that holds up across language pairs and one that generates support tickets from everyone outside your primary market.

Defining the core concepts

What is language identification (LID)?

Language identification is a pre-processing step that identifies the spoken language in audio by comparing acoustic features against a set of supported language profiles. Once LID identifies the dominant language, a monolingual ASR back-end takes over.

The cascade multilingual ASR architecture runs LID as a front-end alongside multiple monolingual back-ends, passing the transcript from whichever model the LID output selects. This works reliably when a speaker uses one language throughout a session, but it has a structural limitation: it classifies the dominant language of an utterance, not every language token within it. When the language changes mid-sentence, the LID module either misses the re-route timing or fails to re-route at all.

A concrete example: a call center agent in Manila opens with "Good morning, how can I help you?" and the customer responds with "Okay, pero hindi ko maintindihan ang inyong sinabi." LID identifies the dominant language as Filipino, but the model now needs to handle phonological transitions that a pure Filipino monolingual model wasn't trained on, and accuracy drops immediately.

What is code-switching?

Code-switching is the linguistic phenomenon where a multilingual speaker alternates between two or more languages during the same conversation. It's intentional and rule-governed, not random.

Researchers identify two structural types from Sociolinguistics research:

Intersentential switching: The switch happens at a sentence boundary. Example in Assyrian-English: "Ani wideili. What happened?"
Intrasentential switching: The switch happens within a single sentence. Example in Spanish-English: "La onda is to fight y jambar."

Both types occur naturally in customer support calls, international meetings, and any product serving users across language boundaries. A model that only identifies the primary language of an utterance cannot handle either type correctly, because multilingual ASR models must anticipate that each sample may contain more than one language and train accordingly.

Code-switching vs. code-mixing

Teams often use these terms interchangeably, but the distinction matters for ASR evaluation. Code-switching changes language between sentences, while code-mixing within one sentence embeds changes, inserting a lexical item from one language into the grammatical structure of another.

Concept	Definition	Linguistic example	ASR challenge
Code-switching	Alternating languages across sentence boundaries	"I was going to the store. Pero no tenía dinero."	LID may re-route between sentences but adds latency
Code-mixing	Blending languages within the same sentence	"I was going to the tienda but I forgot my cartera."	Cascaded LID routing faces challenges without sentence boundaries

‍

Code-mixing is harder for any cascaded LID system because no sentence boundary triggers a model switch. The system receives a single utterance containing phonemes from two languages and must transcribe both correctly without a routing signal.

Why code-switching breaks standard ASR systems

Linguistic and acoustic challenges

Monolingual ASR models train on phoneme inventories specific to one language. When a speaker switches languages mid-utterance, the model encounters phonemes that fall outside its trained inventory, producing hallucinated words that approximately match the acoustic signal or dropped segments. Amazon Science research on ASR and LID documents why cascaded pipelines fail here: running multiple monolingual ASR systems in parallel with a standalone LID module is "neither cost-effective for more than two languages, nor suitable for on-device scenarios where compute resources and memory are limited." For contact center audio, 8kHz telephony compression removes the frequency information that helps distinguish similar phonemes across language pairs, causing accuracy to degrade significantly on compressed code-switched call recordings compared to studio-quality monolingual audio.

Data scarcity in low-resource languages

Training a code-switching model requires annotated audio with actual language alternations, and data is scarce for language pairs, making development costly and hard to scale. The MDPI research on CS ASR identifies severe data imbalances for non-English-dominant languages as a persistent challenge. Language pairs like English-Tagalog or French-Arabic have a fraction of the annotated data available for English-Spanish, which is already sparsely represented compared to monolingual English datasets, so vendors without access to diverse multilingual training data often face challenges on production audio from BPO and contact center environments.

Evaluating ASR performance: WER and DER

Two metrics matter most when evaluating code-switching capability in production.

Word Error Rate (WER) measures ASR accuracy using the formula: (Substitutions + Deletions + Insertions) / Total Reference Words. A 5% WER on clean English audio can inflate to 20-30% or higher on code-switched audio with the same model. Microsoft Research found 6.3% WER reduction by training specifically on code-switched data, confirming that monolingual training data is structurally insufficient for bilingual audio.

Diarization Error Rate (DER) measures the percentage of total recording time incorrectly attributed to a speaker or to non-speech. In code-switched audio, DER can rise when speaker turn detection systems trained on monolingual audio register a language switch as a speaker change. The pyannote metrics reference documents DER as the de facto standard for evaluating diarization, and you should request it specifically for multilingual audio when evaluating vendors.

Typical production targets include WER below 5% for your primary language on realistic audio and DER below 10% for multi-speaker calls. Gladia’s latest benchmarks evaluate 8 STT providers across 7 datasets and 74 hours of audio, including Mozilla Common Voice and Google FLEURS, providing a clearer reference point for validating vendor claims.

What this costs at scale

The pricing architecture matters as much as the technical architecture when you're modeling unit economics at 10,000 hours per month. Stacked add-on pricing makes costs harder to forecast at volume because each separately metered feature adds a fixed per-minute or per-token charge on top of the others, and those incremental line items sum across your total usage.

The table below models total cost at three volumes with a standard feature set (diarization, NER, sentiment analysis, summarization) enabled, using the Gladia pricing page, the AssemblyAI pricing breakdown, and the Deepgram pricing breakdown:

Monthly volume	Gladia Growth async (all features included)	AssemblyAI (base + 4 add-ons)	Deepgram (base rate only, add-ons extra)
100 hours	$20.00	~$43.00	~$25.80 base + variable add-ons
1,000 hours	$200.00	~$430.00	~$258.00 base + variable add-ons
10,000 hours	$2,000.00	~$4,300.00	~$2,580.00 base + variable add-ons

‍

AssemblyAI notes: Base rate is $0.15/hr. Adding sentiment analysis ($0.02/hr), summarization ($0.03/hr), entity detection ($0.08/hr), and topic detection ($0.15/hr) brings the effective all-feature rate to approximately $0.43/hr, which is what the figures above reflect. Pricing verified as of Q1 2024.

Deepgram notes: Base rate is approximately $0.258/hr ($4.30/1,000 min). Audio intelligence features (sentiment analysis, topic detection, summarization) bill per token rather than per minute, so the true all-feature cost is both higher than the figures above and harder to forecast before you process the audio.

At 10,000 hours with Deepgram, your actual bill depends on the token count of your transcripts for intelligence features, which you can't reliably project in advance.

Gladia notes: Gladia's Growth plan at $0.20/hr async makes it the lowest all-inclusive rate in the comparison. At 10,000 hours, the total is $2,000 with diarization, sentiment analysis, NER, summarization, code-switching, and translation included. No feature adds to the bill. The Starter plan at $0.61/hr offers the same feature set on a pay-as-you-go basis with no volume commitment, making it the entry point for teams validating

At every volume tier, Gladia's all-inclusive model produces a lower total cost than competitors once diarization and audio intelligence features are counted in.

Approaches to handling multilingual audio

The unified framework approach

The alternative to cascaded LID-plus-monolingual-ASR is a unified end-to-end multilingual model that processes multiple languages simultaneously without a routing step. The tokenizer vocabulary covers all supported languages, so the model handles language transitions within a single inference pass rather than handing off between components.

The practical outcome: no latency gap between LID classification and ASR inference, and no routing error when the dominant language classification is wrong. When evaluating vendors, ask directly whether their multilingual model uses a unified end-to-end architecture or a cascaded LID-plus-routing system, because the latency, accuracy, and cost implications differ significantly.

How Gladia handles code-switching in production

Native code-switching detection

Solaria-1, launched in January 2026, handles code-switching natively across 100+ supported languages, including 42 not available on competing APIs, among them Tagalog, Bengali, Punjabi, Tamil, Urdu, Persian, and Marathi. These languages are spoken at high volume in BPO contact centers across Southeast Asia and South Asia, where English-to-local-language code-switching is a daily operational reality. Gladia's BPO use case page documents how teams in those environments deploy the API.

You enable code-switching through the code-switching API parameter, and automatic language detection handles the rest within the same inference pass. Each utterance segment in the API response carries a language tag and word-level timestamps, which downstream features consume directly without additional processing. Speaker diarization (powered by pyannoteAI's Precision-2 model via the Gladia diarization pipeline) runs alongside transcription, and sentiment analysis, NER, and summarization from the audio intelligence suite apply to the full multilingual transcript with no additional configuration. Partial transcripts arrive in under 103ms and final transcripts at 270ms latency for real-time pipelines, where latency budget is a constraint.

Claap, a meeting intelligence platform, achieved 1-3% WER in production across multiple languages and transcribes one hour of video in under 60 seconds using Gladia's async transcription workflow. Xavier G. from Scoreplay describes the integration experience on G2:

"In less than a day of dev work we were able to release a state-of-the-art speech-to-text engine! We have tested it across many many languages (we work with commentators in pro sports around the world) and have found great accuracy even with custom fields such as team names, player names, etc. We have never come across any sort of hallucination." - Verified user review of Gladia

When you're modeling your ASR vendor decision at 10,000 hours per month, the difference between language identification and native code-switching can affect whether your cost model remains predictable when users switch languages mid-sentence. The free tier gives you 10 hours to run Gladia against your own code-switched audio, and multiple customers report sub-24-hour integration. Start with 10 free hours and test Solaria-1 on the audio your models actually fail on today.

FAQs

What's the exact technical difference between LID and code-switching detection?

LID identifies the dominant language of an audio segment and routes it to a monolingual ASR model. Code-switching detection transcribes multiple languages within a single utterance without a routing step, so it handles intrasentential code-mixing that LID routing structurally cannot.

What WER should I expect on code-switched audio vs. monolingual audio? Monolingual models can experience significant WER degradation on code-switched audio compared to monolingual audio. Microsoft Research documented up to 6.3% relative WER improvement from adding code-switched training data to a previously monolingual model.

Does enabling code-switching in Gladia affect latency?

No. Solaria-1 handles code-switching natively within the same inference pass, so no additional routing or classification step adds to your latency budget. Partial transcript latency stays under 103ms and final transcript latency stays at 270ms regardless of whether code-switching is enabled.

Which languages does Gladia's code-switching support cover?

100+ languages supported by Solaria-1, including 42 not available on competing APIs, among them Tagalog, Bengali, Punjabi, Tamil, Urdu, Persian, and Marathi.

Key terms

Language identification (LID): A pre-processing step that classifies the dominant language in an audio segment to route it to a monolingual ASR model. LID works reliably for single-language audio but cannot handle intrasentential language mixing.

Code-switching: Alternating between two or more languages within a conversation, either between sentences (intersentential) or within a single sentence (intrasentential). It's a naturally occurring feature of bilingual and multilingual speech communities.

Code-mixing: A subset of code-switching where lexical items from one language are inserted into the grammatical structure of another within the same sentence. Code-mixing is harder for cascaded LID systems because no sentence boundary triggers model re-routing.

Word Error Rate (WER): The standard ASR accuracy metric, calculated as (Substitutions + Deletions + Insertions) / Total Reference Words. WER degrades significantly on code-switched audio with monolingual models compared to matched-language audio.

Diarization Error Rate (DER): The percentage of total recording time incorrectly attributed to a speaker or to non-speech. DER increases on code-switched audio when speaker turn detection trains exclusively on monolingual data and interprets language switches as speaker changes.

Contact us

Your request has been registered

A problem occurred while submitting the form.

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Newsletter

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

Accept

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

New model: Solaria-3

Test our real-time and async transcription

2026 Meeting Assistant Report

Read more

Call center transcription software: what enterprises should look for in 2026

PII redaction for call recordings: how ingestion-level redaction keeps calls PCI compliant

GDPR, SOC 2, and ISO 27001 speech-to-text: the contact center compliance and certification guide

Code-switching vs. language identification: what's the difference?

Defining the core concepts

What is language identification (LID)?

What is code-switching?

Code-switching vs. code-mixing

Why code-switching breaks standard ASR systems

Linguistic and acoustic challenges

Data scarcity in low-resource languages

Evaluating ASR performance: WER and DER

What this costs at scale

Approaches to handling multilingual audio

The unified framework approach

How Gladia handles code-switching in production

Native code-switching detection

FAQs

Key terms

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.