Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Pricing
Get started
Get started

Read more

Speech-To-Text

Gladia integration recipes: connect calls to your CRM and workflow stack

TL;DR: Connecting call data to CRM and workflow tools requires accurate transcription at the base layer — downstream records are only as reliable as the words captured first. This guide covers four integration paths: Zapier for prototyping, Make.com for visual conditional routing, n8n self-hosted for high-volume privacy-sensitive workloads, and direct REST API for production infrastructure. Gladia's Solaria-1 model benchmarks at an average 29% lower WER and 3x lower DER versus alternatives.

Speech-To-Text

How to build a customer support call flow (AI blueprint)

TL;DR: Traditional IVR systems route calls by button press and fail when callers switch languages mid-sentence. AI-augmented flows treat audio as a structured pipeline: async transcription handles the high-accuracy layer for diarization, post-call summaries, and CRM sync, while real-time transcription at sub-300ms latency enables the live agent assist layer covered in this guide. Sub-300ms latency ensures guidance arrives while conversations progress; higher latency reduces assist usefulness. Building in-house involves substantial infrastructure, DevOps, and maintenance costs.

Speech-To-Text

Call transcription accuracy benchmarks: What contact centers should measure

TL;DR: Public STT benchmarks on clean English audio rarely predict how models perform on noisy, accented, multilingual contact center calls. To evaluate vendors properly, measure WER overall, WER per language and accent, DER, latency p50/p95/p99, and code-switching accuracy on your own production audio, not vendor test sets. Self-reported accuracy claims are meaningless without published methodology. Hidden per-feature fees for diarization and NER can compound significantly at scale compared to all-inclusive pricing models.

Call transcription accuracy benchmarks: What contact centers should measure

Published on June 5, 2026
by Ani Ghazaryan
Call transcription accuracy benchmarks: What contact centers should measure

TL;DR: Public STT benchmarks on clean English audio rarely predict how models perform on noisy, accented, multilingual contact center calls. To evaluate vendors properly, measure WER overall, WER per language and accent, DER, latency p50/p95/p99, and code-switching accuracy on your own production audio, not vendor test sets. Self-reported accuracy claims are meaningless without published methodology. Hidden per-feature fees for diarization and NER can compound significantly at scale compared to all-inclusive pricing models.

Most product leaders evaluating STT vendors make the same mistake: they run a quick test on a handful of clean English recordings, pick the provider with the lowest stated WER, and call it done. That approach works fine until the first customer from the Philippines calls in and the transcript returns garbled output, or a bilingual support agent switches from Spanish to English mid-sentence and the model silently drops half the utterance.

To build a defensible CCaaS product, you need a standardized way to measure transcription accuracy on your actual data. This guide breaks down the exact benchmark suite to run, from WER on accented speech to DER and code-switching fidelity, so you can predict true production performance and unit economics before committing to a vendor.

Why contact centers need standardized transcription benchmarks

The same model can show significantly different WER in a benchmark versus production, depending on audio quality, speakers, and evaluation conditions. Most publicly available datasets consist of clean, read audio with limited acoustic variability. Models that perform well on those datasets struggle with conversational speech carrying background noise, codec compression, and diverse accents. Vendor-reported numbers make this worse: "99% accuracy" tells you nothing about the language, noise floor, or whether it was measured on the vendor's own curated test set.

Flawed analytics from bad transcripts

Transcription errors aren't just transcript problems. They're downstream data corruption events. When an STT model makes a substitution error, every system reading that transcript may get the wrong input: the CRM can log the wrong intent, the coaching scorecard can mark the agent incorrectly, and the AI summary can send the wrong action item. As Cresta's contact center STT analysis documents, small speech-to-text errors compound into measurable degradation of insight quality across every downstream model built on that transcript.

The specific failure mode to watch is hallucination: fluent-sounding text that doesn't match what was said. A transcript showing "$12,000" when the caller said "$1,200" passes a human skim but breaks every downstream numeric operation.

Benchmarking production voice accuracy

Contact center calls arrive over 8 kHz narrowband codecs with open-plan office noise, overlapping speech, and accents spanning every region your customer base covers. Global teams also introduce code-switching: agents in Southeast Asia mix Tagalog with English, teams in Latin America alternate Spanish and English within a single call.

Public datasets like Mozilla Common Voice and Google FLEURS provide a useful baseline for understanding model behavior across languages, but neither replicates telephony codec artifacts or spontaneous bilingual speech.

Validating transcription quality metrics

Before running a single test, align your team on which metrics matter and why. The table below maps each metric to its business impact.

Metric Definition Business impact
WER (overall) (Substitutions + Deletions + Insertions) / Total reference words Can corrupt summaries, CRM entries, and coaching scores
WER (per language) WER calculated separately per language in your test set Exposes accuracy gaps that aggregate WER may hide
DER Total time with diarization errors / Total audio time Wrong speaker attribution can affect agent performance scoring
Latency p99 99th percentile transcript delivery time Affects SLA ceiling for post-call processing jobs
Code-switching accuracy WER on utterances where language changes mid-sentence Affects reliability in bilingual contact centers

Core transcription accuracy (WER)

Word Error Rate measures transcription accuracy by counting substitutions, deletions, and insertions relative to a human-verified reference transcript, then dividing by the total word count of the reference.

WER = (S + D + I) / N x 100%

A concrete example: if the reference is "the balance is twelve hundred dollars" (6 words) and the hypothesis returns "the balance is twelve dollars" (5 words, with one deletion), WER is calculated as the ratio of edits to reference words. That single deletion can corrupt a financial figure and affect any downstream automation that acted on it.

Utterance Error Rate (UER) is a related metric measuring the percentage of complete utterances containing at least one error, regardless of how many errors appear within it. UER can be relevant when utterances are treated as atomic units for intent classification or entity extraction, where a single word error may affect the classification outcome.

Optimizing WER for accented voices

Accent-related WER degradation is the most common cause of the gap between lab benchmarks and production performance. Many STT models show lower accuracy on accented speech compared to native-speaker audio. When a caller with a Philippine, Indian, or Nigerian accent reaches your contact center, the model may encounter speech patterns it has seen less frequently at training time and accuracy can drop. Build accent-specific subsets from your production audio and calculate WER separately for each group to expose this gap before vendor selection.

WER for individual languages

"Supports 100 languages" is a marketing claim, not a performance guarantee. The WER you should care about is the WER for each language in your specific customer distribution, measured on your audio, not the vendor's test set. Some vendors may optimize their headline model for English while multilingual support quality varies across languages, with performance potentially degrading outside the most common European languages. If your contact center serves customers in Tamil, Urdu, Punjabi, or Marathi, test those languages explicitly. Code-switching accuracy and per-language WER diverge significantly between providers on non-Latin script languages.

Speaker diarization accuracy

Speaker diarization segments audio by speaker identity. For contact centers, it's the difference between a coaching scorecard that knows what the agent said versus what the customer said, and one that mixes them together.

DER = (False Alarm + Missed Speech + Speaker Confusion) / Total Audio Duration

DER measures total time with diarization errors as a percentage of total audio duration. A higher DER indicates more of the audio time is incorrectly labeled, which can make agent performance scores unreliable enough that QA teams stop trusting them.

Async workflows can produce better diarization than real-time streams because the model can access the full recording before making speaker assignments. Gladia's async diarization is powered by pyannoteAI's Precision-2 model, a specialized neural architecture trained specifically for speaker segmentation and clustering tasks. The full-context processing enables the model to analyze vocal characteristics across the entire recording, producing more accurate speaker boundaries and fewer attribution errors compared to streaming approaches that must make decisions incrementally.

Latency metrics: p50, p95, p99

P50 tells you what a typical request looks like. P95 tells you where the tail starts. P99 tells you the 99th percentile of response times, representing what the slowest 1% of requests experience. As OneUptime's latency percentile analysis documents, a large spread between median and p99 can indicate specific problems. For batch contact center analytics, understanding your p99 latency across peak call volume helps determine whether analytics jobs complete within desired processing windows. For real-time live-assist workflows, our Solaria-1 delivers final transcripts in approximately 300ms, but async remains the right architecture for deep analytics where full-context processing yields better accuracy and diarization.

Evaluating code-switching transcriptions

Code-switching occurs when a speaker alternates between two or more languages within a single conversation. It's foundational to contact center ASR because it maps directly to how bilingual agents and customers actually speak. Common failure modes include degraded output where the model struggles with the second language, and cases where language detection may not handle the secondary language effectively. These failures propagate silently through CRM and analytics pipelines before anyone traces wrong coaching data back to a language routing error.

How to test transcription accuracy on your own audio

A vendor evaluation that doesn't use your production audio isn't an evaluation. It's a demo. Here is how to build a test methodology that produces defensible results.

Defining your audio test set

Your test set should cover four dimensions:

  1. Acoustic conditions: Mix mobile, desktop, and telephony channels with various codec types and background noise.
  2. Language distribution: Build subsets for each language representing a meaningful share of your call volume.
  3. Speaker demographics: Include diverse agent and customer accents, including regional variants.
  4. Difficulty tiers: Categorize calls as clean (low noise, single speaker, single language), medium (moderate noise, two speakers, possible accents), and hard (high noise, overlapping speech, code-switching).

Mozilla Common Voice and Google FLEURS supplement language coverage gaps but don't replace your actual call audio.

Preparing reference data for WER

Ground truth transcripts should be human-verified by transcribers fluent in each test language, producing verbatim references including filler words and overlapping speech. Avoid using another STT vendor's output as ground truth. Inconsistencies in reference data quality can affect WER calculations and may lead to underestimating error rates.

Run call accuracy benchmarks

Send each audio file to every provider via their production API using default settings, with no custom tuning or prompt engineering. Apply text normalization consistently to both hypothesis and reference transcripts before WER computation, as providers may differ in how they handle capitalization, punctuation, and number formatting. Our open benchmark methodology provides a reproducible framework. Record p50, p95, and p99 latency per provider, calculate WER and DER using an open-source evaluation library, and segment all results by acoustic tier, language, and accent group.

Validating call accuracy benchmarks

A flat WER across your entire test set can hide important performance variations. Break results down along three axes:

  • WER by difficulty tier: An aggregate WER that combines clean and hard calls may mask significant differences in performance across acoustic conditions.
  • WER by language: Per-language breakdowns expose vendors who have optimized for English and added multilingual support as a thin wrapper.
  • DER by speaker count: Diarization error rates may vary with the number of speakers. Test DER at realistic speaker counts if your contact center handles conference escalations.

What good benchmark scores look like in production

Establishing your WER target scores

Based on production deployments and published research:

  • WER below 3% is excellent for conversational speech and represents what top-tier production systems achieve on well-supported languages under moderate acoustic conditions.
  • WER in the low single digits is generally workable for most contact center analytics where downstream NLP systems can handle occasional errors.
  • WER above 10% typically degrades automation value and may require manual review before AI output is trusted. For CCaaS products where transcripts power CRM population, coaching scores, and compliance audits, production teams typically aim for WER below 5% on moderate acoustic conditions.

Optimal latency for live transcription

For post-call analytics, latency is about processing windows, not milliseconds. A batch workflow that processes a one-hour call in under 60 seconds can meet most reporting requirements. What matters is p99 latency across peak call volume, because that determines whether overnight analytics jobs complete before the morning QA review. For real-time live-assist workflows, the budget tightens significantly, and async remains the better choice for deep analytics where full-context processing yields substantially better accuracy and diarization results.

Setting diarization accuracy benchmarks

A DER below 15% is the threshold for reliable speaker-labeled analytics. Above 15%, a meaningful fraction of coaching scores will attribute the wrong speaker to the wrong turn. For two-speaker calls in production deployments, well-configured async diarization pipelines have achieved DER below 10% on clean audio and below 15% on audio with overlapping speech.

Common testing mistakes that invalidate benchmarks

Testing on clean studio audio

Evaluating primarily on clean, low-noise audio may not predict contact center performance. Models can show significantly higher WER in production than in benchmarks. The delta is explained largely by audio conditions. Include your most challenging audio in evaluations, as vendor performance differences often become most apparent under difficult acoustic conditions.

Testing with unrealistic language data

Testing only US English may not predict performance in a global rollout. Accuracy patterns across language families vary, but the core point is simple: language coverage counts are a feature list, not necessarily a performance guarantee. Ask every vendor to provide WER data on your specific target languages, and disqualify any vendor that can't produce it.

Why average latency is insufficient

Average latency can hide tail behavior that impacts production architectures. As OneUptime's latency percentile analysis explains, tail latency can reflect specific problem categories such as complex processing scenarios or load conditions. Measure and compare p95 and p99 explicitly, not just the mean.

Skewing accuracy on code-switching

Testing single-language files against a multilingual vendor may not reveal what happens when speakers switch. Build a dedicated code-switching subset, calculate WER separately on pre-switch and post-switch utterances, and compare the delta across vendors.

Gladia's results on the same benchmark suite

Realistic audio benchmark setup

We evaluated Solaria-1 against 8 STT providers across 7 datasets and 74 hours of audio using an open, reproducible methodology designed to avoid the selection bias that makes most vendor benchmarks untrustworthy. Full results are published in the open async STT benchmark and can be independently reproduced. Solaria-1 achieves on average 29% lower WER than alternatives on conversational speech and on average 3x lower DER in async diarization workflows.

In production, Aircall cut transcription time by 95%, and now processes over 1 million calls per week through Gladia. The transcript serves as the foundational layer for search, AI summaries, sentiment analysis, agent coaching, and CRM webhooks from a single integration. When modeling cost, confirm whether diarization and NER are in the base rate or billed as add-ons, since per-feature fees compound at scale.

Prove transcription accuracy on your data

The benchmark that matters most is the one you run on your own audio. Gladia's Starter plan includes 10 free hours per month, which is enough to run a complete evaluation across your test set before committing to any plan.

Call transcription benchmark checklist:

  • Build test set across clean, medium, and hard acoustic tiers from actual production calls
  • Include dedicated subsets for every language representing a meaningful share of call volume
  • Build a code-switching subset if your contact center serves bilingual markets
  • Commission 100% human-verified ground truth transcripts
  • Normalize hypothesis and reference text before WER computation
  • Calculate WER overall, WER by language, WER by difficulty tier
  • Calculate DER on multi-speaker subsets
  • Record p50, p95, and p99 latency across the full test set
  • Evaluate code-switching WER on pre-switch and post-switch utterances separately
  • Model total cost at 1,000, 5,000, and 10,000 hours per month with all features enabled

Start with 10 free hours and run your own benchmark on production audio.

FAQs

What WER threshold is acceptable for contact center production use?

WER below 5% is acceptable for most contact center analytics, with below 3% being the target for high-accuracy requirements like compliance and CRM population. WER above 10% on medium-difficulty audio is a disqualifier for downstream AI automation because errors compound through every system reading the transcript.

How do I calculate diarization error rate?

DER combines three components (false alarm speech, missed speech, and speaker confusion) and divides total error time by total audio duration. A DER of 15% indicates that 15% of audio time experiences diarization errors across these error types.

How should I structure my audio test set for real contact center calls?

Draw calls from actual production audio across clean, medium, and hard acoustic conditions, and build dedicated subsets for each language representing a meaningful share of your call volume. Include both clean and code-switched examples to stress-test language detection routing as well as noise handling.

When should I re-test accuracy benchmarks?

Re-test when you expand into a new geographic market, when a vendor announces a model update, or when you change your telephony or codec infrastructure. A shift in support ticket volume related to transcription quality is also a signal worth responding to, particularly after a new language rollout.

Does testing on a small sample give reliable WER results?

A test set with limited per-language coverage will produce statistically unstable WER numbers, especially for languages underrepresented in your call volume. As a general guideline, aim for at least 1-2 hours of audio per language subset to reduce sampling variance, with more hours needed for languages with high internal variability (dialects, code-switching). Aggregate WER on a small clean sample will almost always understate production error rates because clean calls are overrepresented relative to the hard tier that drives real failures.

Key terms glossary

Word Error Rate (WER): The ratio of transcription errors (substitutions, deletions, and insertions) to the total number of words in the reference transcript, expressed as a percentage. A WER of 5% indicates that approximately 5 words out of every 100 are transcribed incorrectly.

Diarization Error Rate (DER): A metric combining missed speech, false alarm speech, and speaker confusion, divided by total audio duration. It measures how accurately a system segments and attributes speech to the correct speaker.

Code-switching: The practice of alternating between two or more languages within a single conversation or utterance. For contact center ASR, it most commonly appears in bilingual markets such as Tagalog-English, Hinglish, or Spanish-English caller interactions.

Utterance Error Rate (UER): The percentage of complete utterances containing at least one transcription error, regardless of how many errors appear within that utterance. UER is most relevant when utterances are treated as atomic units for intent classification or entity extraction.

P99 latency: The 99th percentile of end-to-end response time across a set of API requests, meaning 99% of requests are faster than this value. P99 is the architectural ceiling to design against for SLA-bound processing pipelines.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more