Most product leaders evaluating STT vendors make the same mistake: they run a quick test on a handful of clean English recordings, pick the provider with the lowest stated WER, and call it done. That approach works fine until the first customer from the Philippines calls in and the transcript returns garbled output, or a bilingual support agent switches from Spanish to English mid-sentence and the model silently drops half the utterance.
To build a defensible CCaaS product, you need a standardized way to measure transcription accuracy on your actual data. This guide breaks down the exact benchmark suite to run, from WER on accented speech to DER and code-switching fidelity, so you can predict true production performance and unit economics before committing to a vendor.
Why contact centers need standardized transcription benchmarks
The same model can show significantly different WER in a benchmark versus production, depending on audio quality, speakers, and evaluation conditions. Most publicly available datasets consist of clean, read audio with limited acoustic variability. Models that perform well on those datasets struggle with conversational speech carrying background noise, codec compression, and diverse accents. Vendor-reported numbers make this worse: "99% accuracy" tells you nothing about the language, noise floor, or whether it was measured on the vendor's own curated test set.
Flawed analytics from bad transcripts
Transcription errors aren't just transcript problems. They're downstream data corruption events. When an STT model makes a substitution error, every system reading that transcript may get the wrong input: the CRM can log the wrong intent, the coaching scorecard can mark the agent incorrectly, and the AI summary can send the wrong action item. As Cresta's contact center STT analysis documents, small speech-to-text errors compound into measurable degradation of insight quality across every downstream model built on that transcript.
The specific failure mode to watch is hallucination: fluent-sounding text that doesn't match what was said. A transcript showing "$12,000" when the caller said "$1,200" passes a human skim but breaks every downstream numeric operation.
Benchmarking production voice accuracy
Contact center calls arrive over 8 kHz narrowband codecs with open-plan office noise, overlapping speech, and accents spanning every region your customer base covers. Global teams also introduce code-switching: agents in Southeast Asia mix Tagalog with English, teams in Latin America alternate Spanish and English within a single call.
Public datasets like Mozilla Common Voice and Google FLEURS provide a useful baseline for understanding model behavior across languages, but neither replicates telephony codec artifacts or spontaneous bilingual speech.
Validating transcription quality metrics
Before running a single test, align your team on which metrics matter and why. The table below maps each metric to its business impact.
| Metric |
Definition |
Business impact |
| WER (overall) |
(Substitutions + Deletions + Insertions) / Total reference words |
Can corrupt summaries, CRM entries, and coaching scores |
| WER (per language) |
WER calculated separately per language in your test set |
Exposes accuracy gaps that aggregate WER may hide |
| DER |
Total time with diarization errors / Total audio time |
Wrong speaker attribution can affect agent performance scoring |
| Latency p99 |
99th percentile transcript delivery time |
Affects SLA ceiling for post-call processing jobs |
| Code-switching accuracy |
WER on utterances where language changes mid-sentence |
Affects reliability in bilingual contact centers |
Core transcription accuracy (WER)
Word Error Rate measures transcription accuracy by counting substitutions, deletions, and insertions relative to a human-verified reference transcript, then dividing by the total word count of the reference.
WER = (S + D + I) / N x 100%
A concrete example: if the reference is "the balance is twelve hundred dollars" (6 words) and the hypothesis returns "the balance is twelve dollars" (5 words, with one deletion), WER is calculated as the ratio of edits to reference words. That single deletion can corrupt a financial figure and affect any downstream automation that acted on it.
Utterance Error Rate (UER) is a related metric measuring the percentage of complete utterances containing at least one error, regardless of how many errors appear within it. UER can be relevant when utterances are treated as atomic units for intent classification or entity extraction, where a single word error may affect the classification outcome.
Optimizing WER for accented voices
Accent-related WER degradation is the most common cause of the gap between lab benchmarks and production performance. Many STT models show lower accuracy on accented speech compared to native-speaker audio. When a caller with a Philippine, Indian, or Nigerian accent reaches your contact center, the model may encounter speech patterns it has seen less frequently at training time and accuracy can drop. Build accent-specific subsets from your production audio and calculate WER separately for each group to expose this gap before vendor selection.
WER for individual languages
"Supports 100 languages" is a marketing claim, not a performance guarantee. The WER you should care about is the WER for each language in your specific customer distribution, measured on your audio, not the vendor's test set. Some vendors may optimize their headline model for English while multilingual support quality varies across languages, with performance potentially degrading outside the most common European languages. If your contact center serves customers in Tamil, Urdu, Punjabi, or Marathi, test those languages explicitly. Code-switching accuracy and per-language WER diverge significantly between providers on non-Latin script languages.
Speaker diarization accuracy
Speaker diarization segments audio by speaker identity. For contact centers, it's the difference between a coaching scorecard that knows what the agent said versus what the customer said, and one that mixes them together.
DER = (False Alarm + Missed Speech + Speaker Confusion) / Total Audio Duration
DER measures total time with diarization errors as a percentage of total audio duration. A higher DER indicates more of the audio time is incorrectly labeled, which can make agent performance scores unreliable enough that QA teams stop trusting them.
Async workflows can produce better diarization than real-time streams because the model can access the full recording before making speaker assignments. Gladia's async diarization is powered by pyannoteAI's Precision-2 model, a specialized neural architecture trained specifically for speaker segmentation and clustering tasks. The full-context processing enables the model to analyze vocal characteristics across the entire recording, producing more accurate speaker boundaries and fewer attribution errors compared to streaming approaches that must make decisions incrementally.
Latency metrics: p50, p95, p99
P50 tells you what a typical request looks like. P95 tells you where the tail starts. P99 tells you the 99th percentile of response times, representing what the slowest 1% of requests experience. As OneUptime's latency percentile analysis documents, a large spread between median and p99 can indicate specific problems. For batch contact center analytics, understanding your p99 latency across peak call volume helps determine whether analytics jobs complete within desired processing windows. For real-time live-assist workflows, our Solaria-1 delivers final transcripts in approximately 300ms, but async remains the right architecture for deep analytics where full-context processing yields better accuracy and diarization.
Evaluating code-switching transcriptions
Code-switching occurs when a speaker alternates between two or more languages within a single conversation. It's foundational to contact center ASR because it maps directly to how bilingual agents and customers actually speak. Common failure modes include degraded output where the model struggles with the second language, and cases where language detection may not handle the secondary language effectively. These failures propagate silently through CRM and analytics pipelines before anyone traces wrong coaching data back to a language routing error.
How to test transcription accuracy on your own audio
A vendor evaluation that doesn't use your production audio isn't an evaluation. It's a demo. Here is how to build a test methodology that produces defensible results.
Defining your audio test set
Your test set should cover four dimensions:
- Acoustic conditions: Mix mobile, desktop, and telephony channels with various codec types and background noise.
- Language distribution: Build subsets for each language representing a meaningful share of your call volume.
- Speaker demographics: Include diverse agent and customer accents, including regional variants.
- Difficulty tiers: Categorize calls as clean (low noise, single speaker, single language), medium (moderate noise, two speakers, possible accents), and hard (high noise, overlapping speech, code-switching).
Mozilla Common Voice and Google FLEURS supplement language coverage gaps but don't replace your actual call audio.
Preparing reference data for WER
Ground truth transcripts should be human-verified by transcribers fluent in each test language, producing verbatim references including filler words and overlapping speech. Avoid using another STT vendor's output as ground truth. Inconsistencies in reference data quality can affect WER calculations and may lead to underestimating error rates.
Run call accuracy benchmarks
Send each audio file to every provider via their production API using default settings, with no custom tuning or prompt engineering. Apply text normalization consistently to both hypothesis and reference transcripts before WER computation, as providers may differ in how they handle capitalization, punctuation, and number formatting. Our open benchmark methodology provides a reproducible framework. Record p50, p95, and p99 latency per provider, calculate WER and DER using an open-source evaluation library, and segment all results by acoustic tier, language, and accent group.
Validating call accuracy benchmarks
A flat WER across your entire test set can hide important performance variations. Break results down along three axes:
- WER by difficulty tier: An aggregate WER that combines clean and hard calls may mask significant differences in performance across acoustic conditions.
- WER by language: Per-language breakdowns expose vendors who have optimized for English and added multilingual support as a thin wrapper.
- DER by speaker count: Diarization error rates may vary with the number of speakers. Test DER at realistic speaker counts if your contact center handles conference escalations.
What good benchmark scores look like in production
Establishing your WER target scores
Based on production deployments and published research:
- WER below 3% is excellent for conversational speech and represents what top-tier production systems achieve on well-supported languages under moderate acoustic conditions.
- WER in the low single digits is generally workable for most contact center analytics where downstream NLP systems can handle occasional errors.
- WER above 10% typically degrades automation value and may require manual review before AI output is trusted. For CCaaS products where transcripts power CRM population, coaching scores, and compliance audits, production teams typically aim for WER below 5% on moderate acoustic conditions.
Optimal latency for live transcription
For post-call analytics, latency is about processing windows, not milliseconds. A batch workflow that processes a one-hour call in under 60 seconds can meet most reporting requirements. What matters is p99 latency across peak call volume, because that determines whether overnight analytics jobs complete before the morning QA review. For real-time live-assist workflows, the budget tightens significantly, and async remains the better choice for deep analytics where full-context processing yields substantially better accuracy and diarization results.
Setting diarization accuracy benchmarks
A DER below 15% is the threshold for reliable speaker-labeled analytics. Above 15%, a meaningful fraction of coaching scores will attribute the wrong speaker to the wrong turn. For two-speaker calls in production deployments, well-configured async diarization pipelines have achieved DER below 10% on clean audio and below 15% on audio with overlapping speech.
Common testing mistakes that invalidate benchmarks
Testing on clean studio audio
Evaluating primarily on clean, low-noise audio may not predict contact center performance. Models can show significantly higher WER in production than in benchmarks. The delta is explained largely by audio conditions. Include your most challenging audio in evaluations, as vendor performance differences often become most apparent under difficult acoustic conditions.
Testing with unrealistic language data
Testing only US English may not predict performance in a global rollout. Accuracy patterns across language families vary, but the core point is simple: language coverage counts are a feature list, not necessarily a performance guarantee. Ask every vendor to provide WER data on your specific target languages, and disqualify any vendor that can't produce it.
Why average latency is insufficient
Average latency can hide tail behavior that impacts production architectures. As OneUptime's latency percentile analysis explains, tail latency can reflect specific problem categories such as complex processing scenarios or load conditions. Measure and compare p95 and p99 explicitly, not just the mean.
Skewing accuracy on code-switching
Testing single-language files against a multilingual vendor may not reveal what happens when speakers switch. Build a dedicated code-switching subset, calculate WER separately on pre-switch and post-switch utterances, and compare the delta across vendors.
Gladia's results on the same benchmark suite
Realistic audio benchmark setup
We evaluated Solaria-1 against 8 STT providers across 7 datasets and 74 hours of audio using an open, reproducible methodology designed to avoid the selection bias that makes most vendor benchmarks untrustworthy. Full results are published in the open async STT benchmark and can be independently reproduced. Solaria-1 achieves on average 29% lower WER than alternatives on conversational speech and on average 3x lower DER in async diarization workflows.
In production, Aircall cut transcription time by 95%, and now processes over 1 million calls per week through Gladia. The transcript serves as the foundational layer for search, AI summaries, sentiment analysis, agent coaching, and CRM webhooks from a single integration. When modeling cost, confirm whether diarization and NER are in the base rate or billed as add-ons, since per-feature fees compound at scale.
Prove transcription accuracy on your data
The benchmark that matters most is the one you run on your own audio. Gladia's Starter plan includes 10 free hours per month, which is enough to run a complete evaluation across your test set before committing to any plan.
Call transcription benchmark checklist:
- Build test set across clean, medium, and hard acoustic tiers from actual production calls
- Include dedicated subsets for every language representing a meaningful share of call volume
- Build a code-switching subset if your contact center serves bilingual markets
- Commission 100% human-verified ground truth transcripts
- Normalize hypothesis and reference text before WER computation
- Calculate WER overall, WER by language, WER by difficulty tier
- Calculate DER on multi-speaker subsets
- Record p50, p95, and p99 latency across the full test set
- Evaluate code-switching WER on pre-switch and post-switch utterances separately
- Model total cost at 1,000, 5,000, and 10,000 hours per month with all features enabled
Start with 10 free hours and run your own benchmark on production audio.
FAQs
What WER threshold is acceptable for contact center production use?
WER below 5% is acceptable for most contact center analytics, with below 3% being the target for high-accuracy requirements like compliance and CRM population. WER above 10% on medium-difficulty audio is a disqualifier for downstream AI automation because errors compound through every system reading the transcript.
How do I calculate diarization error rate?
DER combines three components (false alarm speech, missed speech, and speaker confusion) and divides total error time by total audio duration. A DER of 15% indicates that 15% of audio time experiences diarization errors across these error types.
How should I structure my audio test set for real contact center calls?
Draw calls from actual production audio across clean, medium, and hard acoustic conditions, and build dedicated subsets for each language representing a meaningful share of your call volume. Include both clean and code-switched examples to stress-test language detection routing as well as noise handling.
When should I re-test accuracy benchmarks?
Re-test when you expand into a new geographic market, when a vendor announces a model update, or when you change your telephony or codec infrastructure. A shift in support ticket volume related to transcription quality is also a signal worth responding to, particularly after a new language rollout.
Does testing on a small sample give reliable WER results?
A test set with limited per-language coverage will produce statistically unstable WER numbers, especially for languages underrepresented in your call volume. As a general guideline, aim for at least 1-2 hours of audio per language subset to reduce sampling variance, with more hours needed for languages with high internal variability (dialects, code-switching). Aggregate WER on a small clean sample will almost always understate production error rates because clean calls are overrepresented relative to the hard tier that drives real failures.
Key terms glossary
Word Error Rate (WER): The ratio of transcription errors (substitutions, deletions, and insertions) to the total number of words in the reference transcript, expressed as a percentage. A WER of 5% indicates that approximately 5 words out of every 100 are transcribed incorrectly.
Diarization Error Rate (DER): A metric combining missed speech, false alarm speech, and speaker confusion, divided by total audio duration. It measures how accurately a system segments and attributes speech to the correct speaker.
Code-switching: The practice of alternating between two or more languages within a single conversation or utterance. For contact center ASR, it most commonly appears in bilingual markets such as Tagalog-English, Hinglish, or Spanish-English caller interactions.
Utterance Error Rate (UER): The percentage of complete utterances containing at least one transcription error, regardless of how many errors appear within that utterance. UER is most relevant when utterances are treated as atomic units for intent classification or entity extraction.
P99 latency: The 99th percentile of end-to-end response time across a set of API requests, meaning 99% of requests are faster than this value. P99 is the architectural ceiling to design against for SLA-bound processing pipelines.