Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Sign up

Get started

How decision intelligence improves customer service consistency in contact centers

TL;DR: Contact centers fail to deliver consistent service when routing infrastructure runs on static rules engines that cannot handle the complexity of real human conversation. Modern speech-to-text infrastructure addresses this by processing raw audio and feeding structured outputs to your CRM, using machine learning to analyze intent, sentiment, and speaker characteristics. Transcription accuracy sets the ceiling for every downstream action: a wrong word silently corrupts a CRM entry, a missed intent misfires a routing decision, and a misread sentiment score delays escalation. This playbook covers how to build and deploy that architecture without blowing your latency budget or your unit economics.

Speech-To-Text

Real-time speech analytics for live agent assist

TL;DR: Live agent assist only works when the transcription layer delivers partial results fast enough for downstream NLP to process within a sub-second window. If the pipeline exceeds 1,000ms total, prompts arrive after agents have already spoken, which inflates Average Handle Time and erodes agent trust. This playbook covers the full real-time pipeline architecture, from streaming transcription through intent analysis to agent desktop rendering, and shows how contact centers can expand QA coverage from a 1-3% manual sample to 100% of interactions without adding headcount.

Speech-To-Text

How to identify prospect companies from sales call transcripts

TL;DR: Most product teams try to run LLM extraction on raw, undiarized transcripts and end up with CRM records polluted by the sales rep's own company names, tools, and competitor mentions. The fix is an async-first pipeline that separates speaker dialogue before any entity extraction happens. This guide walks through a working Python and Claude API pipeline using our async transcription, pyannoteAI Precision-2 diarization, and Solaria-3 or Solaria-1 depending on your language mix, so you extract clean prospect-side signals and sync accurate data to your CRM.

What is Word Error Rate (WER): How it’s calculated, and why it can mislead

Published on March 23, 2026

By Ani Ghazaryan

What is Word Error Rate (WER): How it’s calculated, and why it can mislead

Word Error Rate (WER) is a metric that evaluates the performance of ASR systems by analyzing the accuracy of speech-to-text results. WER metric allows developers, scientists, and researchers to assess ASR performance. A lower WER indicates better ASR performance, and vice versa. The assessment allows for optimizing the ASR technologies over time and helps to compare speech-to-text models and providers for commercial use.

If you’re building anything with speech—meeting bots, voice assistants, call analytics—WER is the number you’re told to trust. Lower WER equals a better model. At least, that’s the assumption. In practice, it’s not that simple. The same model can report 5% WER in a benchmark and 25%+ in production, depending on audio quality, speakers, and evaluation setup. Worse, two providers can report similar WER while behaving very differently on your actual use case. WER isn’t wrong, but it’s incomplete without context, and often misinterpreted.

In this guide, we’ll break down what WER actually measures and how it’s calculated, but more importantly, how it behaves in practice. We’ll look at the evaluation pipeline behind it, the factors that cause scores to shift, and why results that look strong in benchmarks often don’t translate to production.

TL;DR:

Word Error Rate (WER) is the standard way to measure speech-to-text accuracy, but it only tells part of the story.
In practice, WER is highly sensitive to how evaluation is set up: the dataset, normalization rules, and scoring pipeline can all shift the result significantly.
To interpret WER correctly, it has to be tied to a reproducible evaluation setup and tested on data that reflects your actual use case.
Without that, it’s just a number useful for comparison, but not enough to judge real-world performance.

What is word error rate (WER)?

Automatic speech recognition systems convert spoken language into written text. They are used across a wide range of applications, including transcription, voice assistants, meeting summarization, and contact center analytics.

Historically, ASR systems were built as multi-stage pipelines, combining feature extraction, acoustic modeling, and language modeling as separate components. That architecture has largely been replaced by end-to-end neural systems. Modern models rely on large-scale pre-training and learn to map audio directly to text, often across multiple languages and domains. These systems are more flexible and more accurate than earlier approaches, but they are also more sensitive to evaluation conditions. Because they are trained on large and diverse datasets, their behavior varies significantly depending on input characteristics. This variability makes evaluation both more important and more complex.

WER attempts to provide a standardized way to measure transcription accuracy by comparing a model’s output to a reference transcript. It counts how many edits are needed to transform the predicted text into the reference text. These edits fall into three categories: substitutions, deletions, and insertions.

Mathematical formula

Where S represents substitutions, D deletions, I insertions, and N the total number of words in the reference. This formulation is straightforward and has remained unchanged for decades. The complexity arises not from the formula itself, but from how the inputs to that formula are prepared.

When WER works and when it doesn’t

In controlled settings, WER is a useful metric. It provides a clear, quantitative way to compare systems and track improvements over time. If two models are evaluated on the same dataset, with identical preprocessing and scoring, WER can reliably indicate which one produces fewer transcription errors. However, this reliability depends on a critical assumption: that the evaluation setup is consistent and representative. In practice, this assumption rarely holds.

Consider a simple example. A user says, “Call me at five five five zero one nine nine.” One system outputs “Call me at 5550199.” Another outputs the words as spoken. Depending on how numbers are normalized, one system may appear significantly more accurate than the other, even though both outputs are equally usable. This highlights a fundamental limitation: WER measures string differences, not meaning or usability. It treats all errors equally, regardless of their impact. A minor formatting difference is penalized in the same way as a critical semantic error.

In production systems, this mismatch becomes more pronounced. A transcript with a slightly higher WER may still be more useful if it preserves key information correctly. Conversely, a transcript with low WER may contain errors that break downstream applications.

WER as an output of a pipeline

WER is often presented as a single, objective score. In reality, it’s the result of a multi-step evaluation pipeline, and each step introduces assumptions that shape the final number. Before scoring even begins, transcripts are typically normalized. This preprocessing step can include:

Lowercasing text
Removing punctuation
Expanding abbreviations
Standardizing numbers and dates

These transformations may seem minor, but they can significantly change the computed WER. A model can appear more or less accurate depending on how aggressively normalization is applied.

Next comes alignment: the process of matching predicted words to reference words. This is not a trivial step. Insertions, deletions, and substitutions create ambiguity, and different alignment algorithms can produce different error counts for the same pair of transcripts.

Tokenization adds another layer of variability. Languages differ in how words are segmented, and decisions about token boundaries directly affect how errors are counted. This becomes especially important in multilingual settings or in languages with more complex morphology.

Because of these dependencies, WER is not inherently reproducible. Two teams evaluating the same model can report different WER values if their pipelines differ. Without full transparency into preprocessing, alignment, and scoring, comparisons become unreliable. In practice, a WER score is the output of a sequence of transformations rather than a direct property of the model itself. A typical evaluation pipeline looks like this:

Audio → ASR model → raw transcript→ text normalization → tokenization → alignment (Levenshtein) → error counting → WER

Because of this, WER is only meaningful when the entire pipeline is fixed and reproducible. Changing any component can change the score without changing the underlying model.

Factors that influence WER in practice

WER is often interpreted as a measure of model quality, but in practice, observed scores reflect both the model and the conditions under which it’s evaluated.

Training data plays a central role: Public datasets such as LibriSpeech or Common Voice are commonly used for benchmarking, but they represent only a narrow slice of real-world speech. They tend to consist of clean, read audio with limited variability. Models that perform well on these datasets may struggle when exposed to conversational or domain-specific speech.

Acoustic conditions introduce another layer of variability: Background noise, microphone quality, and compression artifacts all degrade signal quality and increase error rates. A model evaluated on studio-quality recordings may appear highly accurate, but its performance can drop significantly in environments such as call centers or mobile recordings.

Speaker variability further complicates evaluation: Differences in accent, speaking rate, and pronunciation affect recognition accuracy. In conversational settings, additional challenges arise from interruptions, overlapping speech, and incomplete sentences. These characteristics are difficult to capture in standard benchmarks.

Language itself adds complexity: Multilingual speech, code-switching, and domain-specific vocabulary all increase the difficulty of transcription. For example, a sentence that mixes languages within the same utterance may not align well with the assumptions of models trained on monolingual data.

These factors interact in non-trivial ways. A model that performs well in one domain may perform poorly in another, even if the overall WER appears competitive. This variability is one of the main reasons why benchmark results often fail to generalize.

Why benchmarking ASR systems is inherently difficult

Benchmarking attempts to provide a controlled environment for comparison, but it introduces its own limitations.

One challenge lies in the definition of ground truth: Reference transcripts are not always consistent. Differences in formatting, punctuation, or normalization can lead to different WER values. In some cases, ground truth data may contain errors, particularly in large or weakly supervised datasets.

Another issue is dataset representativeness: Benchmark datasets are often curated for research purposes and do not reflect production environments. They may exclude noisy audio, multi-speaker interactions, or domain-specific language. As a result, they provide an incomplete picture of system performance.

Benchmark design also introduces bias: Choices about which data to include, how to preprocess it, and how to score results all affect the outcome. Even small differences in normalization or filtering can produce noticeable changes in WER.

A more subtle issue is the lack of standardization: There is no universally accepted evaluation pipeline for ASR. Different organizations use different normalization rules, alignment strategies, and scoring implementations. This lack of consistency makes it difficult to compare results across systems.

Finally, benchmarking is resource-intensive. Building high-quality evaluation datasets and pipelines requires time, expertise, and infrastructure. Many teams rely on vendor-reported metrics as a result, even when those metrics are not fully transparent.

What WER looks like in a reproducible benchmark

To understand how WER behaves in practice, it helps to look at results from a controlled, reproducible evaluation setup.

In Gladia’s recent speech-to-text benchmarks, multiple leading STT APIs were evaluated under identical conditions across diverse datasets, including clean speech, multilingual audio, and long-form conversational recordings. The results highlight a key reality: WER is not a single number—it shifts significantly depending on context.

On clean datasets like Common Voice, top systems achieve WER in the ~3–7% range
On structured but real-world speech (e.g. VoxPopuli), WER drops further (~1.7–3.2%) due to cleaner references
On long-form, domain-specific audio (Earnings calls), WER increases to ~9–14%
On conversational speech (e.g. Switchboard), error rates can exceed 35–60%, driven by interruptions, disfluencies, and speaker variability

This variation is not an anomaly. It is the expected behavior of modern ASR systems. Even when models are evaluated with the same pipeline, no single provider consistently outperforms others across all datasets. Performance depends on how well a system handles specific conditions: noise, speaker overlap, language switching, and domain vocabulary.

This reinforces a practical reality: a single WER number, in isolation, says very little about real-world performance.

How WER behaves across real-world scenarios

ASR performance holds up well on clean, read speech, but starts to break down as conditions become more realistic. In conversational audio, people interrupt each other, speak in fragments, and vary their phrasing; add background noise, and the signal itself becomes harder to interpret. Multilingual and code-switching scenarios push models further, requiring them to switch between languages mid-sentence—something rarely reflected in standard benchmarks. Lower-quality audio introduces even more issues, from hallucinated words to repeated or missing segments, many of which aren’t fully captured by WER. On top of that, domain-specific vocabulary often leads to consistent errors, with a disproportionate impact on downstream tasks.

Beyond WER: evaluating what actually matters

WER is a useful baseline. But in production, it’s rarely what breaks your system. What matters isn’t whether a transcript is globally accurate. It’s whether it’s functionally correct for your use case. A single error in a phone number, a date, or a proper name can break an entire workflow, even if the rest of the transcript is flawless. WER treats that the same as a missing “the” or a punctuation difference. From a system perspective, those errors are not equal.

The same applies to meaning. A transcript that swaps “Tuesday” for “Thursday” may still score well on WER, but it’s objectively wrong. And in many applications, that’s a critical failure. This is why evaluation is shifting. Teams are moving beyond generic accuracy metrics toward measurements that reflect real usage: entity accuracy, character-level precision, and, increasingly, task-level outcomes like search relevance or data extraction quality. The question is no longer “How close is this to the reference?” It’s “Does this transcript actually work?”

The only way to answer that is to test models on real audio, not just benchmarks. That’s exactly what Gladia is built for: try it on your data and evaluate what actually matters.

FAQ

What is a good WER?
There is no universal threshold because WER depends heavily on audio conditions and use case. For clean, read speech, a WER below 10% is typically considered strong, while in noisy, conversational, or multi-speaker settings, 15–30% may still be acceptable. The key is to evaluate WER relative to your domain, not in isolation.

Is lower WER always better?
Not necessarily. WER measures surface-level differences, not meaning, so a lower score does not guarantee that the transcript is more useful. In many applications, preserving critical information such as names, numbers, or intent matters more than minimizing total word errors.

Why does WER vary between providers?
WER varies because providers often use different datasets, normalization rules, and evaluation pipelines. Even small differences in preprocessing or scoring can lead to significantly different results. Without a shared evaluation setup, WER comparisons are not directly comparable.

Can two systems have the same WER but different quality?
Yes. Two transcripts can have identical WER while differing in semantic accuracy or usefulness. For example, one may preserve key entities correctly while the other introduces critical errors that affect downstream tasks.

Should vendor benchmarks be trusted?
Vendor benchmarks can provide a directional signal, but they should be interpreted cautiously. They are only meaningful if the methodology, datasets, and evaluation pipeline are fully transparent and reproducible. The most reliable approach is to test systems on your own data under controlled conditions.

Glossary

ASR (Automatic Speech Recognition): Technology that converts spoken language into text, typically using end-to-end neural models trained on large-scale audio data.

WER (Word Error Rate): A metric that measures transcription accuracy by counting substitutions, deletions, and insertions relative to a reference transcript.

Normalization: The process of standardizing text before evaluation, including handling numbers, punctuation, casing, and formatting to ensure consistent comparison.

Ground truth: The reference transcript used for evaluation, ideally accurate and consistently formatted, though in practice it may contain variability or errors.

Contact us

Your request has been registered

A problem occurred while submitting the form.

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Newsletter

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

Accept

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

New model: Solaria-3

Test our real-time and async transcription

2026 Meeting Assistant Report

Read more

How decision intelligence improves customer service consistency in contact centers

Real-time speech analytics for live agent assist

How to identify prospect companies from sales call transcripts

What is Word Error Rate (WER): How it’s calculated, and why it can mislead

What is word error rate (WER)?

Mathematical formula

When WER works and when it doesn’t

WER as an output of a pipeline

Factors that influence WER in practice

Why benchmarking ASR systems is inherently difficult

What WER looks like in a reproducible benchmark

How WER behaves across real-world scenarios

Beyond WER: evaluating what actually matters

FAQ

Glossary

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.