Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Read more

Speech-To-Text

Mastering multilingual speech-to-text: handle code-switching with AI

The article explains why code-switching makes multilingual speech-to-text harder, especially when speakers switch languages mid-sentence or use accents in noisy environments.

Speech-To-Text

Best Whisper alternatives for 2026: Comparison of top speech-to-text APIs

The article compares the top Whisper alternatives for 2026 across accuracy, latency, pricing, features, and production readiness.

Speech-To-Text

Mastering CRM data enrichment: AI & speech-to-text for smarter leads

The article explains how AI and speech-to-text can enrich CRM records by turning sales calls into structured lead data like names, budgets, timelines, sentiment, and intent signals. It covers pipeline architecture, accuracy testing, compliance, cost planning, CRM integration, and production monitoring.

Code-switching best practices: testing and quality assurance

Published on Apr 23, 2026
by Ani Ghazaryan
Code-switching best practices: testing and quality assurance

Monolingual WER testing misses code-switching failures. This guide shows how to catch multilingual ASR regressions with PIER, representative datasets, per-language thresholds, canary testing, and CI/CD gates before they reach production.

TL;DR:

  • Monolingual WER conceals 30–50% degradation on code-switched audio, shipping silent regressions to CRM entries, coaching scores, and meeting summaries.
  • Catching these regressions requires per-pair PIER measurement, not aggregate WER.
  • Build a CI pipeline that blocks on PIER threshold breaches for your highest-volume language pairs, using datasets like CS-FLEURS, LinCE, and SLR104.
  • Canary deployments and documented rollback policies backstop the CI gate.

If your CI pipeline only measures monolingual WER, code-switching regressions reach production silently. This is the code-switching blind spot. Code-switching, the practice of alternating between two or more languages within a single conversation, is a common communication mode for bilingual users worldwide. When your QA pipeline doesn't account for it, regressions reach production undetected, corrupting CRM entries, coaching scores, and meeting summaries with no internal alert to flag them. This article breaks down how to build a QA pipeline that catches code-switching regressions before they reach production, from sourcing the right datasets to setting CI/CD failure policies that block broken builds automatically.

Code-switching's impact on legacy QA

Monolingual QA: a code-switching blind spot

Standard QA pipelines measure word error rate against a monolingual test set, and that number tells you almost nothing about how your system performs when a speaker switches languages mid-sentence. Monolingual ASR systems can see WER degradation of 30–50% on code-switched audio compared to clean monolingual input. The tokenizer produces [UNK] tokens at language boundaries, the transcript drops words, and real-time streams can stall at switch points. None of this shows up in your aggregate WER dashboard because the monolingual majority of your test set absorbs the errors, and your pipeline reports a 5% WER across all audio and ships the build. Your Finnish-Swedish users, your Tagalog-English contact center agents, and your Hindi-English meeting participants all experience something far worse.

Detecting code-switching regressions

The core problem is metric mismatch. PIER (Point-of-Interest Error Rate) was developed specifically to address this gap. Unlike standard WER, PIER tags a set of "points of interest" (the embedded-language tokens at each language switch) and computes edit operations only against that subset, producing an error rate focused purely on code-switched segments. This targeted measurement reveals code-switching degradation that aggregate WER conceals.

Production divergence from lab QA

Contact center audio includes hold music bleed, overlapping speakers, background noise, and callers using regional slang. Multilingual meeting transcription involves accented speech across multiple time zones on compressed video conferencing codecs. A test set that doesn't reflect those conditions will pass builds that fail in production, which is exactly how accuracy regressions surface through support tickets rather than internal alerts.

Building diverse test sets with natural code-switching patterns

Sourcing production-grade code-switching

Two open-source datasets provide a viable baseline for production-grade code-switching QA:

Dataset Language pairs Audio type Access
CS-FLEURS 52 languages, 113 unique pairs Human-read (synthetically generated text) ISCA Archive
LinCE 4 pairs (ES-EN, NE-EN, HI-EN, MSA-EA) Text corpora (NLP tasks, useful for NER/entity QA) Hugging Face
SLR104 HI-EN and BN-EN (spoken tutorials) Conversational OpenSLR

CS-FLEURS covers language pairs with real human voices reading synthetically generated code-switched sentences, providing human-validated accuracy and fluency. LinCE provides code-switched corpora across Spanish-English, Nepali-English, Hindi-English, and Modern Standard Arabic-Egyptian Arabic pairs, making it useful for both ASR evaluation and downstream entity extraction testing.

Beyond these baselines, sample real user audio from production with explicit consent and annotate a representative subset. Custom multilingual annotation is expensive and slow, often months of specialist annotation time per 100 hours of audio, but the cost becomes easier to justify when you weigh it against the sprint cycles your engineering team spends debugging code-switching bugs after they reach production.

Synthetic vs. organic code-switching data

Synthetic data is faster and cheaper. Use it to bootstrap coverage for low-resource pairs. But it can fail to replicate the acoustic and linguistic complexity of real code-switching. Concatenating audio segments from separate monolingual recordings may produce unnatural prosody and artifacts at splice points. Regional accents, localized slang, and conversational fillers tend to be underrepresented in synthetic datasets, so a model that passes synthetic tests can still fail on production audio from accented bilingual speakers. Use synthetic data to bootstrap coverage for low-resource pairs, but weight your regression suite toward organic samples for your highest-traffic language pairs.

Ground truth annotation for code-switched audio requires:

  • Bilingual annotators fluent in both languages (not just translators)
  • Word-level timestamps at each switch point
  • Language tags for every segment

Without language-tagged ground truth, you can't compute PIER and you're back to aggregate WER that hides the regressions you care about.

Segmenting QA by language pair

Rank language pairs by usage volume

Pull your production audio metadata and rank language pairs by call volume or meeting frequency. Most products follow a power-law distribution where two or three pairs account for the majority of multilingual traffic and a long tail accounts for the rest. Segment your test suite to mirror that distribution, allocating the most test coverage and strictest thresholds to your highest-volume pairs.

Rank your language pairs by historical regression frequency and assign higher monitoring priority to pairs that have regressed before. Track error rates as a function of switching density (switches per minute) rather than just per language pair, and plot PIER against switches-per-minute to identify where production failures will concentrate.

Define per-pair thresholds based on the baseline you measure from your production traffic sample. For example:

Dataset Language pairs Audio type Access
CS-FLEURS 52 languages, 113 unique pairs Human-read (synthetically generated text) ISCA Archive
LinCE 4 pairs (ES-EN, NE-EN, HI-EN, MSA-EA) Text corpora (NLP tasks, useful for NER/entity QA) Hugging Face
SLR104 HI-EN and BN-EN (spoken tutorials) Conversational OpenSLR

Values are illustrative. Calibrate every threshold against your own production baseline before enforcing them as CI gates.

Language switching often isn't symmetric. A Spanish-dominant speaker inserting English technical terms may behave differently from an English-dominant speaker inserting Spanish greetings. Design test cases for both directions within each pair: matrix-language-dominant with embedded-language insertions, and the reverse. A test suite covering only one direction will miss half the failure modes. Include samples at multiple density levels per pair to cover realistic usage patterns, from light switching (occasional insertions) to heavy switching (frequent alternations within sentences).

Automated checks for code-switching errors

Manual review doesn't scale for regression detection. Automate your PIER calculation in every model evaluation run and alert on threshold breaches. Log the following for each run:

  • Overall WER
  • Per-language WER for each language in the pair
  • PIER at each code-switch point
  • Diarization error rate (DER) where speaker attribution is involved

Comparing these numbers across model versions gives you a regression signal before any build reaches staging.

Validating multilingual code-switching with canaries

When deploying a model update, route a small canary percentage of real multilingual production traffic to the new version before full rollout. Track per-language WER and PIER on that canary cohort in real time. A PIER regression on your top-volume language pair is a stop signal.

For the async response schema including language detection fields returned by our endpoint, refer to the code-switching documentation and the automatic language detection guide.

Define your rollback policy before you need it. A practical approach: establish clear thresholds for PIER regressions on high-volume language pairs that trigger automatic rollback, with human review required before the build can be re-promoted. Less severe regressions on low-volume pairs should generate warnings and documented exceptions. Having this policy written down before an incident means the rollback decision doesn't become a negotiation during an outage.

Integrating code-switching tests into CI/CD pipelines

Shift-left testing, moving evaluation earlier in the development cycle, reduces the cost of catching regressions from a production rollback to a CI failure. The earlier the check runs, the cheaper the fix.

MLOps continuous delivery frameworks treat model validation as a first-class CI stage alongside code validation. For code-switching specifically, model validation covers per-language WER and PIER thresholds in addition to standard schema and pipeline consistency checks. Define minimum sample sizes per language pair based on your traffic distribution and statistical confidence requirements. Reserve the full evaluation suite for pre-release gates rather than every pull request to keep core regression checks fast.

Accelerating multilingual code-switching QA

import gladia
import sys

# Initialize Gladia client
client = gladia.Client(api_key="YOUR_API_KEY")

# Submit async transcription with code-switching enabled
response = client.audio.transcribe(
    audio_url="https://your-test-bucket/cs_test_es_en_high_density.wav",
    language_config={
        "languages": ["es", "en"],
        "code_switching": True
    },
    diarization=True
)

# Evaluate PIER against threshold
cs_segments = [s for s in response.utterances if s.language_switch]

# Reference the PIER calculation at https://github.com/enesyugan/PIER-CodeSwitching-Evaluation
# for implementation guidance, or integrate with your evaluation framework
pier_score = evaluate_pier(cs_segments, ground_truth="cs_test_es_en_gt.json")

PIER_THRESHOLD = 0.08  # Example: adjust based on your baseline

if pier_score > PIER_THRESHOLD:
    print(f"FAIL: PIER {pier_score:.2%} exceeds threshold {PIER_THRESHOLD:.2%}")
    sys.exit(1)

print(f"PASS: PIER {pier_score:.2%} within threshold")

Not all regressions warrant a build block. Structure your policy in two tiers:

Block (hard fail):

  • PIER regression exceeding your defined threshold on any language pair accounting for more than 5% of monthly multilingual audio volume
  • Overall WER regression exceeding your baseline threshold on top language pairs
  • DER regression exceeding your established baseline on async test sets

Warn (soft flag):

  • Moderate PIER regressions on low-volume language pairs
  • Any new language pair showing elevated WER with limited test samples
  • Regression in high-density switching tests where baseline accuracy was already lower

Three failure modes to watch for at language boundaries

In practice, code-switching failures tend to cluster into three patterns observed in customer escalations, each worth testing for explicitly at language boundaries, even with per-pair thresholds in place.

  1. Wrong entity at a language boundary: CRM entries or NER outputs capture garbled text where the switch point fell.
  2. Dropped clause after a switch: ASR returns a partial transcript that cuts off the embedded-language segment.
  3. Correct words but wrong language tag: The transcript is accurate, but language metadata causes downstream routing or translation logic to fail.

Get started with Solaria-1

Start with 10 free hours and have your integration in production in less than a day. Solaria-1 natively handles mid-conversation code-switching across 100+ languages, so you can reduce your manual test burden for multilingual language-pair coverage from day one.

FAQs

When should code-switching tests run in CI/CD?

Gate the full multi-pair test suite on merges to main and on release candidate builds. For feature branch PRs, run a fast smoke test with a smaller sample from your top language pairs and a looser threshold to keep cycle times manageable while still catching major breakages.

Can you test code-switching accuracy without pre-annotated ground truth?

Yes, by using an oracle model comparison approach: submit audio to a high-accuracy reference model, treat its output as a temporary baseline for a sample covering at least 200 code-switched utterances per language pair, and combine that with human review of 10–15% of sampled code-switched utterances until annotated ground truth is available. Google Cloud's MLOps guidance recommends canary deployments as the primary validation mechanism when labeled data is incomplete.

What is PIER and why does it matter more than WER for code-switching?

PIER (Point-of-Interest Error Rate) is a variant of WER that measures edit operations only at code-switch points, specifically the embedded-language tokens where language boundaries occur. Standard WER averages errors across all words, which lets a model score 5% overall while producing disproportionately high error concentrations at language switch points, the exact pattern PIER was designed to expose.

What is a defensible WER threshold for blocking a code-switching regression in CI/CD?

Define hard-block thresholds based on your established baseline and downstream impact tolerance. Set absolute or relative regression limits for PIER on high-volume language pairs and overall WER on monolingual segments, with both values calibrated against your own production baseline rather than applied as fixed industry standards.

How does Solaria-1 handle mid-conversation code-switching?

We built Solaria-1 to handle true mid-conversation code-switching across 100+ languages without requiring a language parameter reset between segments or a separate language identification step. On conversational speech, we show up to 29% lower WER than competing APIs, evaluated across 8 providers and 74+ hours of audio with an open methodology.

Key terms glossary

Word error rate (WER): The standard metric for ASR accuracy. Computed as the number of substitutions, deletions, and insertions required to convert the hypothesis transcript into the reference transcript, divided by the total number of words in the reference. WER averages errors across all words in the test set, which causes it to mask localised failures such as those at language switch points in code-switched audio.

DER (Diarization error rate): The standard metric for speaker attribution accuracy. Measures the proportion of audio incorrectly assigned to a speaker, combining missed speech, false alarm speech, and speaker confusion errors. Relevant in multi-speaker code-switching scenarios where language switches and speaker turns can coincide.

Code-switching: The practice of alternating between two or more languages within a single conversation or utterance. Common among bilingual speakers and contact center agents serving multilingual customer bases. Distinct from language switching between separate conversations.

Point of interest (PoI): In PIER evaluation, a PoI is a token from the embedded language at a code-switch boundary, the specific word or sub-word unit where the speaker transitions from the matrix language. PoIs are the evaluation targets PIER uses instead of the full word sequence.

CI/CD (Continuous integration / continuous delivery): A software delivery practice in which code changes are automatically built, tested, and prepared for release on every commit. In the context of this article, CI/CD gates include ASR model evaluation stages that enforce per-language-pair WER and PIER thresholds before a build can progress to staging or production.

Canary deployment: A release strategy in which a new model version is exposed to a small percentage of live production traffic before full rollout. Canary metrics (per-language WER, PIER on high-volume pairs) are monitored in real time, and a regression beyond a defined threshold triggers rollback before the broader user base is affected.

Shift-left testing: The practice of moving quality checks earlier in the development lifecycle, from post-deployment monitoring into CI/CD gates or even pre-commit hooks. Applied to ASR, shift-left means running per-pair PIER checks on every merge to main rather than waiting for support tickets from bilingual users to surface regressions.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more