Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Read more

Speech-To-Text

How to build a meeting assistant with async transcription and LLM: Complete architecture guide

Build a meeting assistant with async transcription and LLMs using clean architecture, diarization, and multilingual support.

Speech-To-Text

Rev.ai vs Gladia: Complete comparison for global teams (2026)

Rev.ai vs Gladia comparison for 2026: pricing, accuracy, and language coverage benchmarks to help product teams choose the right API.

Speech-To-Text

Building a Google Meet transcription bot: step-by-step API integration with real-time captions

Building a Google Meet transcription bot requires audio capture via Playwright and real-time STT API integration in under a week.

Code-switching vs. language identification: what's the difference?

Published on April 1, 2026
Ani Ghazaryan
Code-switching vs. language identification: what's the difference?

Code-switching detection transcribes multilingual speech accurately. Language identification routes audio but fails mid-sentence switches.

TL;DR: Language identification (LID) detects the primary language in an audio file and routes it to a monolingual model. Code-switching detection transcribes accurately when a speaker changes languages mid-sentence, which LID alone can't handle. Standard ASR systems fail on code-switched audio, producing WER regressions that degrade sentiment analysis, NER, and diarization. Gladia's Solaria-1 handles native code-switching across 100+ languages, with strong performance on multilingual and accented speech, without requiring routing pipelines or additional feature configuration.

Your QA pipeline passed on English audio. The support tickets from bilingual users switching between English and Tagalog mid-conversation tell a different story, and the WER regressions don't just affect the transcript. They cascade into every downstream feature your product depends on: sentiment scores go wrong, named entities get dropped, and diarization breaks on the speaker turn that triggered the language switch.

The root cause is an architectural mismatch. Most vendors sell "multilingual support" that is actually language identification, routing audio to separate monolingual models. That architecture works until a speaker changes languages mid-utterance, at which point the pipeline receives input it was never trained to handle. Understanding this distinction before you commit to a vendor is the difference between a production system that holds up across language pairs and one that generates support tickets from everyone outside your primary market.

Defining the core concepts

What is language identification (LID)?

Language identification is a pre-processing step that identifies the spoken language in audio by comparing acoustic features against a set of supported language profiles. Once LID identifies the dominant language, a monolingual ASR back-end takes over.

The cascade multilingual ASR architecture runs LID as a front-end alongside multiple monolingual back-ends, passing the transcript from whichever model the LID output selects. This works reliably when a speaker uses one language throughout a session, but it has a structural limitation: it classifies the dominant language of an utterance, not every language token within it. When the language changes mid-sentence, the LID module either misses the re-route timing or fails to re-route at all.

A concrete example: a call center agent in Manila opens with "Good morning, how can I help you?" and the customer responds with "Okay, pero hindi ko maintindihan ang inyong sinabi." LID identifies the dominant language as Filipino, but the model now needs to handle phonological transitions that a pure Filipino monolingual model wasn't trained on, and accuracy drops immediately.

What is code-switching?

Code-switching is the linguistic phenomenon where a multilingual speaker alternates between two or more languages during the same conversation. It's intentional and rule-governed, not random.

Researchers identify two structural types from Sociolinguistics research:

  • Intersentential switching: The switch happens at a sentence boundary. Example in Assyrian-English: "Ani wideili. What happened?"
  • Intrasentential switching: The switch happens within a single sentence. Example in Spanish-English: "La onda is to fight y jambar."

Both types occur naturally in customer support calls, international meetings, and any product serving users across language boundaries. A model that only identifies the primary language of an utterance cannot handle either type correctly, because multilingual ASR models must anticipate that each sample may contain more than one language and train accordingly.

Code-switching vs. code-mixing

Teams often use these terms interchangeably, but the distinction matters for ASR evaluation. Code-switching changes language between sentences, while code-mixing within one sentence embeds changes, inserting a lexical item from one language into the grammatical structure of another.

Concept Definition Linguistic example ASR challenge
Code-switching Alternating languages across sentence boundaries "I was going to the store. Pero no tenía dinero." LID may re-route between sentences but adds latency
Code-mixing Blending languages within the same sentence "I was going to the tienda but I forgot my cartera." Cascaded LID routing faces challenges without sentence boundaries

Code-mixing is harder for any cascaded LID system because no sentence boundary triggers a model switch. The system receives a single utterance containing phonemes from two languages and must transcribe both correctly without a routing signal.

Why code-switching breaks standard ASR systems

Linguistic and acoustic challenges

Monolingual ASR models train on phoneme inventories specific to one language. When a speaker switches languages mid-utterance, the model encounters phonemes that fall outside its trained inventory, producing hallucinated words that approximately match the acoustic signal or dropped segments. Amazon Science research on ASR and LID documents why cascaded pipelines fail here: running multiple monolingual ASR systems in parallel with a standalone LID module is "neither cost-effective for more than two languages, nor suitable for on-device scenarios where compute resources and memory are limited." For contact center audio, 8kHz telephony compression removes the frequency information that helps distinguish similar phonemes across language pairs, causing accuracy to degrade significantly on compressed code-switched call recordings compared to studio-quality monolingual audio.

Data scarcity in low-resource languages

Training a code-switching model requires annotated audio with actual language alternations, and data is scarce for language pairs, making development costly and hard to scale. The MDPI research on CS ASR identifies severe data imbalances for non-English-dominant languages as a persistent challenge. Language pairs like English-Tagalog or French-Arabic have a fraction of the annotated data available for English-Spanish, which is already sparsely represented compared to monolingual English datasets, so vendors without access to diverse multilingual training data often face challenges on production audio from BPO and contact center environments.

Evaluating ASR performance: WER and DER

Two metrics matter most when evaluating code-switching capability in production.

Word Error Rate (WER) measures ASR accuracy using the formula: (Substitutions + Deletions + Insertions) / Total Reference Words. A 5% WER on clean English audio can inflate to 20-30% or higher on code-switched audio with the same model. Microsoft Research found 6.3% WER reduction by training specifically on code-switched data, confirming that monolingual training data is structurally insufficient for bilingual audio.

Diarization Error Rate (DER) measures the percentage of total recording time incorrectly attributed to a speaker or to non-speech. In code-switched audio, DER can rise when speaker turn detection systems trained on monolingual audio register a language switch as a speaker change. The pyannote metrics reference documents DER as the de facto standard for evaluating diarization, and you should request it specifically for multilingual audio when evaluating vendors.

Typical production targets include WER below 5% for your primary language on realistic audio and DER below 10% for multi-speaker calls. Gladia’s latest benchmarks evaluate 8 STT providers across 7 datasets and 74 hours of audio, including Mozilla Common Voice and Google FLEURS, providing a clearer reference point for validating vendor claims.

What this costs at scale

The pricing architecture matters as much as the technical architecture when you're modeling unit economics at 10,000 hours per month. Stacked add-on pricing makes costs harder to forecast at volume because each separately metered feature adds a fixed per-minute or per-token charge on top of the others, and those incremental line items sum across your total usage.

The table below models total cost at three volumes with a standard feature set (diarization, NER, sentiment analysis, summarization) enabled, using the Gladia pricing page, the AssemblyAI pricing breakdown, and the Deepgram pricing breakdown:

Monthly volume Gladia Growth async (all features included) AssemblyAI (base + 4 add-ons) Deepgram (base rate only, add-ons extra)
100 hours $20.00 ~$43.00 ~$25.80 base + variable add-ons
1,000 hours $200.00 ~$430.00 ~$258.00 base + variable add-ons
10,000 hours $2,000.00 ~$4,300.00 ~$2,580.00 base + variable add-ons

AssemblyAI notes: Base rate is $0.15/hr. Adding sentiment analysis ($0.02/hr), summarization ($0.03/hr), entity detection ($0.08/hr), and topic detection ($0.15/hr) brings the effective all-feature rate to approximately $0.43/hr, which is what the figures above reflect. Pricing verified as of Q1 2024.

Deepgram notes: Base rate is approximately $0.258/hr ($4.30/1,000 min). Audio intelligence features (sentiment analysis, topic detection, summarization) bill per token rather than per minute, so the true all-feature cost is both higher than the figures above and harder to forecast before you process the audio.

At 10,000 hours with Deepgram, your actual bill depends on the token count of your transcripts for intelligence features, which you can't reliably project in advance.

Gladia notes: Gladia's Growth plan at $0.20/hr async makes it the lowest all-inclusive rate in the comparison. At 10,000 hours, the total is $2,000 with diarization, sentiment analysis, NER, summarization, code-switching, and translation included. No feature adds to the bill. The Starter plan at $0.61/hr offers the same feature set on a pay-as-you-go basis with no volume commitment, making it the entry point for teams validating

At every volume tier, Gladia's all-inclusive model produces a lower total cost than competitors once diarization and audio intelligence features are counted in.

Approaches to handling multilingual audio

The unified framework approach

The alternative to cascaded LID-plus-monolingual-ASR is a unified end-to-end multilingual model that processes multiple languages simultaneously without a routing step. The tokenizer vocabulary covers all supported languages, so the model handles language transitions within a single inference pass rather than handing off between components.

The practical outcome: no latency gap between LID classification and ASR inference, and no routing error when the dominant language classification is wrong. When evaluating vendors, ask directly whether their multilingual model uses a unified end-to-end architecture or a cascaded LID-plus-routing system, because the latency, accuracy, and cost implications differ significantly.

How Gladia handles code-switching in production

Native code-switching detection

Solaria-1, launched in January 2026, handles code-switching natively across 100+ supported languages, including 42 not available on competing APIs, among them Tagalog, Bengali, Punjabi, Tamil, Urdu, Persian, and Marathi. These languages are spoken at high volume in BPO contact centers across Southeast Asia and South Asia, where English-to-local-language code-switching is a daily operational reality. Gladia's BPO use case page documents how teams in those environments deploy the API.

You enable code-switching through the code-switching API parameter, and automatic language detection handles the rest within the same inference pass. Each utterance segment in the API response carries a language tag and word-level timestamps, which downstream features consume directly without additional processing. Speaker diarization (powered by pyannoteAI's Precision-2 model via the Gladia diarization pipeline) runs alongside transcription, and sentiment analysis, NER, and summarization from the audio intelligence suite apply to the full multilingual transcript with no additional configuration. Partial transcripts arrive in under 103ms and final transcripts at 270ms latency for real-time pipelines, where latency budget is a constraint.

Claap, a meeting intelligence platform, achieved 1-3% WER in production across multiple languages and transcribes one hour of video in under 60 seconds using Gladia's async transcription workflow. Xavier G. from Scoreplay describes the integration experience on G2:

"In less than a day of dev work we were able to release a state-of-the-art speech-to-text engine! We have tested it across many many languages (we work with commentators in pro sports around the world) and have found great accuracy even with custom fields such as team names, player names, etc. We have never come across any sort of hallucination." - Verified user review of Gladia

When you're modeling your ASR vendor decision at 10,000 hours per month, the difference between language identification and native code-switching can affect whether your cost model remains predictable when users switch languages mid-sentence. The free tier gives you 10 hours to run Gladia against your own code-switched audio, and multiple customers report sub-24-hour integration. Start with 10 free hours and test Solaria-1 on the audio your models actually fail on today.

FAQs

What's the exact technical difference between LID and code-switching detection?

LID identifies the dominant language of an audio segment and routes it to a monolingual ASR model. Code-switching detection transcribes multiple languages within a single utterance without a routing step, so it handles intrasentential code-mixing that LID routing structurally cannot.

What WER should I expect on code-switched audio vs. monolingual audio? Monolingual models can experience significant WER degradation on code-switched audio compared to monolingual audio. Microsoft Research documented up to 6.3% relative WER improvement from adding code-switched training data to a previously monolingual model.

Does enabling code-switching in Gladia affect latency?

No. Solaria-1 handles code-switching natively within the same inference pass, so no additional routing or classification step adds to your latency budget. Partial transcript latency stays under 103ms and final transcript latency stays at 270ms regardless of whether code-switching is enabled.

Which languages does Gladia's code-switching support cover?

100+ languages supported by Solaria-1, including 42 not available on competing APIs, among them Tagalog, Bengali, Punjabi, Tamil, Urdu, Persian, and Marathi.

Key terms

Language identification (LID): A pre-processing step that classifies the dominant language in an audio segment to route it to a monolingual ASR model. LID works reliably for single-language audio but cannot handle intrasentential language mixing.

Code-switching: Alternating between two or more languages within a conversation, either between sentences (intersentential) or within a single sentence (intrasentential). It's a naturally occurring feature of bilingual and multilingual speech communities.

Code-mixing: A subset of code-switching where lexical items from one language are inserted into the grammatical structure of another within the same sentence. Code-mixing is harder for cascaded LID systems because no sentence boundary triggers model re-routing.

Word Error Rate (WER): The standard ASR accuracy metric, calculated as (Substitutions + Deletions + Insertions) / Total Reference Words. WER degrades significantly on code-switched audio with monolingual models compared to matched-language audio.

Diarization Error Rate (DER): The percentage of total recording time incorrectly attributed to a speaker or to non-speech. DER increases on code-switched audio when speaker turn detection trains exclusively on monolingual data and interprets language switches as speaker changes.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more