Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Pricing
Get started
Get started

Read more

Speech-To-Text

Call center voice analytics: use cases, benefits, and how it works

TL;DR: Contact centers that rely on manual QA for call review typically sample only a small fraction of their total call volume, leaving the vast majority of audio unanalyzed. Voice analytics fixes this by converting raw phone calls into structured, LLM-ready data that feeds QA scorecards, CRM entries, and coaching workflows automatically. The catch is that telephony audio is uniquely hostile to standard speech APIs because narrowband codecs and packet loss break models trained on clean audio. This article explains the technical pipeline, the metrics that matter, and the infrastructure requirements that separate production-ready systems from vendor demos.

Speech-To-Text

Customer sentiment analysis: methods, tools, and what voice data adds

TL;DR: Reliable sentiment analysis requires WER below 5%, speaker diarization that separates customer and agent emotion, and language models that hold performance across accents and code-switching. Text-only sentiment tools miss critical voice signals (pace, talk-over, vocal intensity) that predict churn before survey data surfaces the same risk. Automated sentiment scoring on high-accuracy transcripts shifts QA from sampling 2–5% of calls to monitoring 100% of them, the only coverage level at which churn risk and agent burnout surface early enough to act on.

Speech-To-Text

Named Entity Recognition from call transcripts: improving precision

TL;DR: Standard NER models trained on clean text lose up to 27 F1 points when applied to raw ASR output. For CCaaS operations running automated QA and CRM sync, that gap translates directly into missed account numbers, corrupted customer records, and unreliable coaching scores. The fix starts at the transcription layer. Our Solaria-1 model delivers lower WER on conversational speech and 3x lower DER than alternatives, giving your NER pipeline a clean text foundation before a single field is written to the CRM.

How WER affects conversation intelligence and agent coaching

Published on June 19, 2026
by Ani Ghazaryan
How WER affects conversation intelligence and agent coaching

TL;DR: Word Error Rate (WER) is the accuracy ceiling for every downstream feature including sentiment scoring, CRM enrichment, and compliance triggers. A 5% WER on a 5-minute call produces roughly 38 incorrect words, concentrated on product names, customer names, and compliance phrases your conversation intelligence stack depends on.

Your conversation intelligence tool does not have a hallucination problem. It has a transcription problem. Most product teams spend months tuning LLM prompts, refining summarization templates, and debugging action item extraction logic, while the transcription layer silently feeds those systems corrupted input. A single substitution error turns "can't" into "can" and inverts a sentiment score. A deleted phrase strips a compliance disclosure. A misrecognized name breaks a CRM lookup and writes a ghost record into your pipeline.

Word Error Rate (WER) sets the ceiling for everything your product does with audio. When the transcription layer misses a keyword, misattributes a speaker, or collapses under an accent, every subsequent CRM sync, sentiment score, and compliance flag fails with it. This article breaks down the exact mechanics of how WER propagates through conversation intelligence (CI) workflows, what thresholds matter for which use cases, and how to evaluate whether your current transcription layer is the root cause of the product failures you have been blaming on your LLM.

Decoding word error rate for conversation AI

Measuring word error rate (WER)

WER measures the minimum number of word-level edits required to convert an ASR hypothesis into the correct reference transcript, normalized by the total number of reference words. The formula is:

WER = (Substitutions + Deletions + Insertions) / Total Reference Words x 100%

A substitution occurs when the model outputs a wrong word. A deletion occurs when a word is missing from the output. An insertion occurs when the model adds a word that was never spoken. To illustrate how WER is calculated in practice, consider a simple example:

  • Reference: "the cat sat on the mat"
  • Hypothesis: "the hat sat mat"
  • Errors: 1 substitution ("hat" for "cat"), 2 deletions ("on the")
  • WER: 3 / 6 = 50%

This simplified example shows 50% WER for clarity, but production WER varies widely based on audio quality and conditions.

The anatomy of a WER error

Each error type hits conversation intelligence differently:

  • Substitution: Replaces a real word with a wrong word. This is the most dangerous error type for CI because it silently changes meaning. "Can't" becomes "can." Downstream sentiment, entity extraction, and CRM matching receive plausible but incorrect data with no signal that anything went wrong.
  • Deletion: A word disappears. For compliance workflows, a deleted word in a required legal disclaimer creates a documentation gap the system has no way to flag.
  • Insertion: The model adds a word that was never spoken, creating false action items, phantom entities, and inflated word counts in analytics dashboards.

Sentiment analysis breaks on substitutions. Compliance monitoring breaks on deletions. Named entity recognition is degraded by all three.

Real-time WER: the latency-accuracy tradeoff

Real-time transcription introduces a structural accuracy constraint. The model produces output in fragments before the full sentence is available, so it cannot use downstream context to resolve ambiguous phonemes or speaker transitions. The reason is structural: as chunk latency increases, the model receives more surrounding phonetic context before committing to an output token, which reduces the number of ambiguous resolutions and lowers WER. That tradeoff is fundamental to how streaming inference works.

For most CI workflows, including meeting note-taking, post-call analytics, compliance review, and CRM enrichment, the full audio is available before processing starts. Async transcription processes complete recordings with full context, which improves WER, enables accurate speaker diarization, and produces more consistent output for multilingual audio. Waiting a few seconds or minutes for a post-meeting summary is acceptable when the output needs to be reliable.

The real cost of WER: 5% on a 5-minute call

38 errors: corrupting CI data

Conversational speech averages around 150 words per minute. Using this baseline, a standard 5-minute support call would contain approximately 750 words. At 5% WER, that call would generate roughly 38 incorrect words.

ASR models do not distribute errors across filler words and articles. They concentrate failures on the words that carry the most meaning: proper nouns, product names, numbers, and compliance-specific terms. In a contact center call, those 38 errors fall on exactly the entities your CI stack is trying to extract.

A high-WER transcript fed to an LLM produces a summary that says "Terminate John Smith's account on the prize plan tier" when the agent said "We need to add John Smith to the enterprise plan." The LLM is not hallucinating in the technical sense but constructing a coherent interpretation of corrupted input.

Quantifying WER across daily calls

Scale the same math across monthly call volumes:

Monthly audio volume Total words processed Incorrect words at 5% WER
1,000 hours 9,000,000 450,000
10,000 hours 90,000,000 4,500,000
50,000 hours 450,000,000 22,500,000

At 10,000 hours per month, 5% WER produces roughly 4.5 million incorrect words that feed CRM records, compliance logs, coaching dashboards, and analytics pipelines. Those errors do not announce themselves. A wrong name writes a clean-looking CRM entry. A deleted phrase passes compliance review. A sentiment inversion scores an agent as positive on a call where they failed to resolve the issue.

WER's impact on compliance monitoring

How WER skips key compliance triggers

Compliance monitoring in CCaaS platforms works by checking transcripts for required language: disclosure phrases, consent statements, regulatory notices. A deletion error on a required phrase produces a clean transcript with a missing sentence, and the monitoring system has no baseline to compare against. The disclosure was given but not recorded, so from the platform's perspective it never happened.

In financial services, a deleted phrase in a required risk disclosure creates a documentation gap the compliance system cannot flag because it has no reference point. The regulatory gap accumulates across thousands of calls before anyone notices.

Why WER substitutions cause false positives

Substitution errors create the opposite problem. A transcribed word that phonetically resembles a flagged compliance term triggers a QA review queue on a call that needed no escalation. At volume, this inflates manual review workload and erodes analyst trust in the automated flagging system.

Both failure modes, missed triggers and false positives, share the same root cause: a transcription layer that cannot reliably reproduce the words that were spoken.

WER errors drive compliance risk

The business risk compounds in regulated industries. When your transcription layer produces 5% WER on critical call types, your compliance monitoring is operating with corrupted evidence. Our compliance hub covers SOC 2 Type II, ISO 27001, HIPAA, and GDPR. For teams handling sensitive audio, PII redaction is available and must be explicitly configured in the API call. On Growth and Enterprise plans, customer audio is never used to retrain models, with no opt-out required.

Our infrastructure runs on clusters in both the EU and the US, so teams can keep audio data in the region required by their compliance framework.

"It's based in EU so it fits our GDPR compliance requirements... The product works great." - Robin L. on G2

Is WER skewing your customer sentiment?

Mismatched words skew sentiment data

Our sentiment analysis operates on the transcript output, which means sentiment scores have a direct dependency on transcription accuracy. A substitution error on a single negation word is enough to invert the signal entirely.

When "We can't fix this for you today" becomes "We can fix this for you today," sentiment analysis returns POSITIVE with high confidence on a call where the customer was told no. Your sentiment dashboard shows no error and logs the interaction as successful. The agent who failed to resolve the issue receives a positive coaching score. The pattern persists across similar calls until customer churn reveals the disconnect.

Poor diarization skews sentiment

Our async diarization achieves 3x lower DER compared to alternatives. Even with 0% WER, misattributed speakers break sentiment analysis entirely.

DER measures how often the system misses speech, hallucinates speech, or assigns speech to the wrong speaker. When a high-DER system attributes the customer's "I am extremely frustrated with your service" to the agent, the coaching scorecard flags a disciplinary action on a call where the agent performed well. Speaker diarization is available in async workflows, where the system references the full conversation to resolve speaker boundaries accurately.

WER's effect on agent coaching metrics

When sentiment data is corrupted by transcription errors and diarization failures, agent coaching operates on a false baseline. Over time, coaching built on corrupted data degrades agent performance rather than improving it, because the feedback loop is broken at the source.

"Gladia delivers precise speech-to-text transcriptions with reliable timestamps, making it perfect for downstream tasks." - Verified user on G2

WER inaccuracies undermine CRM enrichment

CRM matching fails on transcribed names

Named Entity Recognition (NER) extracts people, organizations, dates, and contact information from transcripts and routes that data to CRM systems. NER accuracy has a hard dependency on WER: the entity extractor can only identify what the transcription layer produces.

When a name is transcribed incorrectly, the entity extractor either fails to match it against an existing CRM record, matches it to the wrong record, or creates a duplicate. "Contact Siobhan at her Dublin office" becomes "Contact Chevron at her Dublin office," and no amount of CRM deduplication logic corrects a phonetically plausible substitution.

Lost product feedback and forecasting accuracy

Automated product feedback loops depend on keyword detection. A bug name, a feature request phrase, or a competitor mention that gets transcribed incorrectly never surfaces in the analytics dashboard. The product team's signal on what customers are saying in calls is distorted by the same 5% WER that corrupted the sentiment data.

At scale, corrupted CRM data degrades pipeline forecasting. Sales stages, deal values, and competitor mentions recorded from calls feed revenue models. When errors concentrate on exactly the entities that carry commercial meaning, your forecast is built on corrupted data, not a flawed model. The Audio-to-LLM pipeline that generates summaries and action items downstream receives the same corrupted input.

Inaccurate transcripts corrupt agent coaching

Call search and retrieval in CI platforms rely on keyword indexing from transcripts. If a compliance phrase, product name, or objection type was transcribed incorrectly, the call does not appear in search results for that term. A manager building a coaching library around how top agents handle a specific objection cannot find the relevant calls if the objection phrase was misrecognized.

LLM-generated summaries are bounded by the quality of their input. A summary generated from a high-WER transcript produces a coherent paragraph that accurately reflects the corrupted transcript, not the actual conversation. This is the direct causal link between WER and what teams attribute to LLM hallucinations. One of the common implementation mistakes meeting assistant builders make is spending months iterating on prompts and LLM selection when the actual problem is that the transcript fed to the LLM is unreliable.

When summaries, sentiment scores, entity data, and call search are all degraded by the same upstream error rate, agent coaching operates on a fundamentally flawed picture of what happened on any given call. The product's value proposition for quality assurance collapses at the point of use.

WER accuracy: proven performance

Transcribing noisy customer audio

Real production audio is not clean. It contains background noise, overlapping speech, accented speakers, inconsistent microphone quality, and mid-sentence language changes. Our async benchmark shows that WER on clean audio differs significantly from WER on realistic call center conditions with noise, overlap, and accents. The gap between benchmark WER and production WER is where most CI products discover their real accuracy problem.

Multilingual accuracy benchmarks

Most STT APIs were built for American English and tested on clean audio. When speakers have regional accents, switch languages mid-conversation, or speak one of the 100+ languages covered by Solaria-1, accuracy gaps become product-level problems. A customer support platform serving Southeast Asian BPO operations needs reliable WER on Tagalog, Bengali, Tamil, and Punjabi, not just English. For multilingual meeting transcription use cases, language coverage depth is often the deciding factor between a product that works globally and one that degrades silently for non-English users.

Balancing WER performance and product cost

What WER threshold is acceptable for conversation intelligence?

Different CI applications tolerate different WER levels. The table below sets out production thresholds based on the accuracy requirements of each workflow:

CI application Target WER (lower is better) Rationale
Basic topic detection 10-15% Semantic intent survives despite individual word errors
Agent coaching, sentiment analysis Lower thresholds recommended Single-word substitutions flip sentiment scores
Compliance monitoring Under 5%, ideally 1-3% Required disclosure phrases must be captured exactly
CRM enrichment, NER Lower thresholds recommended Single character errors break entity matching
Legal, medical documentation Under 5% Errors carry direct legal or clinical risk

The 5% threshold that looks acceptable in aggregate is already too high for compliance and CRM workflows, which are the exact use cases where conversation intelligence delivers its highest commercial value.

Does real-time transcription increase WER?

In general, streaming transcription produces output in fragments before the full sentence is available, which can limit the model's ability to use downstream context for disambiguation. The tradeoff is fundamental: lower latency typically means less context, which can mean higher WER. For voice agent use cases that require low final transcript latency, this tradeoff is unavoidable.

For post-call analytics, meeting note-taking, compliance review, and CRM enrichment, there is no latency requirement at transcript generation time. Async processing uses the full audio context, which produces lower WER, better diarization accuracy, and more consistent entity extraction.

What drives WER differences by language?

The primary architectural problem for multilingual ASR is the tokenizer. Most end-to-end models use vocabularies built for one language. When phonemes from a second language enter the acoustic stream, the tokenizer produces out-of-vocabulary tokens or character-level fallbacks, which translates directly to elevated WER on that language's content.

In ASR systems built around monolingual language identification pipelines, a language switch mid-sentence can cause the model to stall or produce garbled output at the boundary, producing elevated WER at the language boundary because the language identification module has not yet updated its prior and the acoustic model is decoding against the wrong language's probability distribution.

Post-transcription WER improvement

Three mechanisms reduce effective WER after the base model runs:

  • Custom vocabulary: Adds domain-specific terms, product names, and proper nouns that the base model would likely mis-transcribe. Available on Starter and Growth plans, and directly addresses the NER and CRM enrichment failure modes described above.
  • Custom spelling: Enforces consistent spelling for entities in the transcript output.
  • Model fine-tuning: Domain-specific retraining on audio from your specific vertical, available on Enterprise plans.

Our pricing includes async transcription with diarization, NER, sentiment analysis, translation, and code-switching in the base rate. For a team building an AI note-taker or a CCaaS analytics layer, this all-inclusive structure removes the most common source of cost model surprise at scale.

If your current transcription layer is producing 5%+ WER on the audio types your product actually processes, the downstream failures you are debugging in your LLM layer or CRM sync are likely not fixable at that layer. The ceiling is set by the words that reach those systems. Test Solaria-1 on your own noisy, multilingual audio with 10 free hours. Measure the WER difference against your production baseline before committing to the next layer of the stack.

FAQs

What is the WER formula used in conversation intelligence?

WER is calculated as (Substitutions + Deletions + Insertions) divided by the total number of reference words, expressed as a percentage. For a 750-word call with 38 errors, WER is approximately 5%.

What WER is acceptable for compliance monitoring?

Compliance monitoring requires WER below 5%, ideally 1-3%, because keyword detection fails on deletion errors that silently remove required disclosure phrases. At higher WER, the system cannot distinguish between a call where the disclosure was given and one where it was not.

Does diarization error rate (DER) affect agent coaching scores?

Yes, directly. When DER is high, the agent's words are misattributed to the customer and vice versa, which inverts sentiment scores and corrupts coaching evaluations even when WER itself is low.

Is Gladia's diarization available in real-time mode?

No. Speaker diarization powered by pyannoteAI's Precision-2 model is available in async (batch) workflows only. Real-time speaker attribution requires post-processing for higher accuracy.

Why does code-switching increase WER?

Most ASR tokenizers are built for a single language, so when a speaker switches languages mid-sentence, the model cannot map incoming phonemes to its vocabulary and produces out-of-vocabulary tokens at the switch point. WER typically degrades significantly at language boundaries in monolingual cascade pipelines.

How does high WER cause LLM hallucinations in CI summaries?

LLMs treat the transcript as ground truth, so when the transcript contains substitution errors, the LLM constructs a coherent summary of what the transcript says, not what was spoken. Reducing WER at the transcription layer is the primary mechanism for reducing hallucination rates in downstream LLM outputs.

Key terms glossary

Word Error Rate (WER): The percentage of words in a transcript that were substituted, deleted, or inserted relative to the correct reference transcript. WER sets the accuracy ceiling for all downstream conversation intelligence features.

Diarization Error Rate (DER): A measure of how often a transcription system misassigns speech to the wrong speaker, misses speech entirely, or hallucinates speaker segments. High DER corrupts sentiment attribution and agent coaching scores independent of WER.

Code-switching: The practice of alternating between two or more languages within a single conversation or utterance. Most monolingual ASR models fail at switch points, producing elevated WER on the words where the language changes.

Hallucination reduction: The improvement in downstream LLM output accuracy that results from providing the model with a lower-WER transcript. Many LLM hallucinations in CI summaries stem from corrupted transcription input, not the LLM itself.

Named Entity Recognition (NER): Automated extraction of structured entities (people, organizations, dates, phone numbers) from transcripts for CRM enrichment and analytics. NER accuracy has a direct dependency on WER because entity extraction can only identify what the transcription layer produces correctly.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more