Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Get started

Speech-To-Text

Call center transcription software: what enterprises should look for in 2026

TL;DR: Most contact centers evaluate transcription software using clean-audio lab benchmarks, then watch QA automation break down when BPO (Business Process Outsourcing) agents switch languages mid-call or phone-line noise degrades the signal. In 2026, the criteria that matter are real-world multilingual WER, all-inclusive per-hour pricing, and data sovereignty that holds up under GDPR and HIPAA audit. For enterprise teams, the highest-ROI evaluation step is testing on real BPO call samples rather than vendor demo audio, and asking every shortlisted provider for an all-in per-hour price with diarization, sentiment, and entity extraction enabled.

Speech-To-Text

PII redaction for call recordings: how ingestion-level redaction keeps calls PCI compliant

TL;DR: Legacy pause-and-resume systems don't remove agents, local desktops, or telephony infrastructure from PCI DSS audit scope. Automated, ingestion-level PII redaction scrubs sensitive data before it reaches any database. By removing cardholder data at the ingestion layer, contact center platforms using automated redaction can potentially reduce audit complexity, cut agent handle time (AHT), and protect downstream CRM and LLM pipelines from corrupt data. The accuracy floor for reliable entity detection in PCI audits is significantly higher than for standard QA transcription, making STT model selection a compliance decision as much as a product one.

Speech-To-Text

GDPR, SOC 2, and ISO 27001 speech-to-text: the contact center compliance and certification guide

TL;DR: When your contact center routes voice data through a transcription vendor, every certification gap in that vendor's stack becomes your compliance liability. Voice recordings qualify as personal data under GDPR Article 4, and processing them through uncertified APIs creates direct financial exposure. This guide breaks down what GDPR, SOC 2 Type II, ISO 27001, HIPAA, and PCI DSS each require of your audio infrastructure vendor and maps those requirements to the QA coverage rates and cost-per-contact metrics you manage daily. We hold GDPR, SOC 2 Type II, ISO 27001, HIPAA, and PCI DSS certifications, and never use customer audio for model training on Growth or Enterprise plan.

Factors affecting the accuracy of speech-to-text transcripts

Published on May 29, 2026

by Ani Ghazaryan

TL;DR: Production STT accuracy fails not because of model benchmarks, but because of the gap between studio evaluation audio and the messy, multilingual, overlapping speech real users produce. Four root causes drive that gap: input audio quality, speaker traits (accents, code-switching, and overlap), domain vocabulary deficits, and model training data diversity. WER alone doesn't capture production risk. Semantic accuracy and Diarization Error Rate matter just as much when CRM syncs, coaching scores, and AI summaries all depend on what the transcript gets right. Solaria-1 delivers on average 29% lower WER on conversational speech and 3x lower DER compared to alternatives, benchmarked across 7 datasets and 74+ hours of audio with open, reproducible methodology.

A model's baseline benchmark score rarely predicts its production Word Error Rate, and the problem rarely originates from the model itself. It originates from the gap between studio-quality evaluation audio and the messy, multilingual, overlapping speech your actual users produce.

Transcription accuracy is foundational for downstream AI features in your product. A wrong name can corrupt a CRM entry. A missed entity can produce a misleading coaching score. A mishandled language switch can degrade output quality. This article breaks down the four pillars that dictate STT accuracy, explains how to measure each, and gives you a framework for benchmarking models against the real-world conditions your users create.

Update: new model and updated pricing

Since this article was written, Gladia has released Solaria-3 and updated its pricing. Solaria-3 is our newest speech model, built for real-world business audio that’s noisy, fast-paced, and conversational. It ranks #1 on real English customer calls and across core European languages (EN, FR, DE, ES, IT), beating AssemblyAI, ElevenLabs, Deepgram, Mistral, and Speechmatics. Solaria-1 remains fully supported and is still the better pick for broad language coverage (100+ languages), code-switching, and real-time streaming. The two models are built to complement each other.

On pricing, Gladia now offers three plans: Starter (pay-as-you-go at $0.61/hr async / $0.75/hr real-time, with 10 free hours/month), Growth (as low as $0.20/hr async with volume discounts), and Enterprise (custom pricing, zero data retention, SLAs). Every paid plan includes all features, all languages, and full compliance (GDPR, HIPAA, SOC 2 Type 2).

Compare Solaria-1 and Solaria-3 →

See current pricing →

Optimizing speech-to-text word error rate

Before diving into the failure modes, you need a consistent measurement framework. Accuracy claims mean nothing without specifying the metric, the audio condition, and the dataset.

How WER measures transcription quality

WER is the foundational metric for comparing STT systems. The formula counts word-level edits needed to turn a hypothesis into a reference transcript, divided by total reference words:

WER = (Substitutions + Deletions + Insertions) / Reference Words

Substitutions happen when the model replaces one word with another. Deletions happen when it drops a word entirely. Insertions add a word that was never spoken. A standard WER calculation across a 20-word sentence with two errors yields a 10% WER.

WER is necessary but not sufficient on its own. Two other metrics matter for production evaluation:

Metric	What it measures	Key limitation
WER	Word-level edit distance between hypothesis and reference	Doesn't distinguish high-impact errors from minor ones
Normalized Error Rate (NER)	WER after text normalization (lowercasing, punctuation removal, number expansion)	May obscure meaningful differences between systems
Semantic Accuracy	Whether the meaning of the utterance is preserved	A 5% WER transcript that drops "not" from "do not approve" has 0% semantic accuracy for that clause
RTF	Processing time divided by audio duration	RTF below 1.0 indicates faster-than-real-time processing. Gladia's processing time is approximately 60 seconds per 3,600 seconds (1 hour) of audio content

‍

Optimizing purely for WER doesn't always maximize meaning preservation. A transcript can look accurate by edit-distance while still failing the downstream system that depends on it.

Four pillars of STT accuracy

Every transcription failure in production traces back to one of four root causes:

Input audio: Sample rate, codec, and signal-to-noise ratio determine the acoustic information available to the model.
Speaker traits: Accents, code-switching, and concurrent voices create patterns the model may not have learned during training.
Domain vocabulary: Out-of-vocabulary (OOV) words and named entities cause substitution errors that break downstream NLP pipelines.
Model architecture and training data: The breadth and diversity of training data sets the ceiling for how well the model handles the first three pillars.

Input audio: minimizing WER and latency

The acoustic signal reaching the model limits what's possible regardless of model quality. Internet connectivity drops and packet loss in VoIP calls further degrade the signal by forcing the model to reconstruct audio from incomplete data.

Fine-tuning sample rate for WER

Sample rate determines the maximum frequency the model receives. Higher sample rates capture a broader frequency range that can improve phoneme discrimination. At 8kHz (standard telephony), the model receives limited frequency information. At 16kHz (broadband), the model gains access to higher frequency bands, which can improve consonant recognition for modern ASR systems.

For any audio you control at the capture layer, 16kHz or higher is the correct starting point. Gladia accepts WAV, M4A, FLAC, AAC, and URL inputs, with recommended parameters documented per use case.

Choosing codecs for STT performance

Lossless codecs (FLAC, WAV) are generally preferred for STT as they preserve more of the original waveform. Lossy codecs (MP3, AAC, compressed VoIP) may introduce artifacts through compression. For any audio you control at the source, favor WAV or FLAC.

Noise: what's your accuracy cost?

Background noise degrades accuracy by masking speech when it occupies similar frequency ranges as the vocal signal. HVAC hum adds constant low-frequency energy. Street traffic and call center floor noise introduce overlapping conversations in the ambient mix.

When SNR drops far enough that background noise approaches or exceeds speech signal levels, deletion errors increase as words disappear below the noise floor and insertion errors increase as noise gets misclassified as speech.

For Contact Center as a Service (CCaaS) environments where audio quality is outside your control, model robustness on noisy audio becomes the primary evaluation criterion. Test vendors explicitly on telephony-grade, noisy recordings before committing.

How speaker traits influence transcription

Models learn acoustic patterns from training data. When a speaker's phoneme patterns don't match the demographics of the training corpus, accuracy drops. Many legacy STT systems were built on datasets with limited demographic diversity, so speakers from underrepresented groups often experience higher error rates.

Regional dialects and STT performance

L1 (native language) phoneme patterns bleed into L2 (second language) speech in predictable ways. An L1-Hindi speaker producing the English phoneme /θ/ ("the") typically substitutes /d/ or /t/, producing "de" in place of "the." A model trained on American English encounters "de" and searches its vocabulary for phonetically similar common words, often returning "they" or "their," not the intended word.

The result is systematic substitution errors tied to the speaker's L1, not random noise. Products serving multilingual user bases accumulate these errors across every call, and they compound in downstream systems where a correctly sounding but wrong word corrupts CRM entries or coaching scores.

How code-switching affects STT

Code-switching describes mid-conversation language changes, from switching sentences to switching within a single utterance ("I need this done, c'est très urgent"). Single-language ASR models struggle at these boundaries: when the model identifies the first audio segment as English, it may apply English phoneme and language models to the French segment, producing errors that look like confident but wrong transcripts.

The contact center impact is direct: agents in Southeast Asia, South Asia, and Latin America routinely switch between English and local languages within a single call. Without native code-switching support, the model may produce errors in the switching segments that break downstream sentiment analysis and entity extraction.

Solaria-1 handles code-switching by assigning a language code per word in the transcript output. A typical response for a mid-sentence English-to-French switch looks like this:

{
  "transcription": [
    {
      "type": "word",
      "transcription": "I",
      "language": "en",
      "start_time": 0.0,
      "end_time": 0.2
    },
    {
      "type": "word",
      "transcription": "need",
      "language": "en",
      "start_time": 0.2,
      "end_time": 0.5
    },
    {
      "type": "word",
      "transcription": "c'est",
      "language": "fr",
      "start_time": 1.2,
      "end_time": 1.5
    }
  ]
}

The code-switching documentation covers configuration for both async and real-time modes. For a deeper technical comparison of code-switching behavior across providers, the technical comparison for production speech-to-text examines specific implementation differences.

How overlapping speech inflates DER

Diarization Error Rate (DER) measures speaker attribution accuracy. When two speakers talk simultaneously, the model receives a mixed signal: a single audio frame containing two voices. This breaks the assumption that each frame belongs to one speaker, forcing the diarization system to make an attribution guess under ambiguous conditions.

DER is calculated from three components: false alarm (attributing speech to a non-existent speaker), missed detection (failing to detect speech), and speaker confusion (assigning speech to the wrong speaker). Overlapping speech inflates all three. Missed detection increases when the overlapping frame is suppressed entirely. Speaker confusion increases when the model assigns the mixed frame to the wrong speaker. False alarms increase when ambient bleed from the overlapping voice is mistaken for a third speaker. In a two-speaker contact center call, a speaker confusion error can reassign agent utterances to the customer, corrupting sentiment scores and CRM ownership attribution in downstream workflows.

Gladia's async diarization is powered by pyannoteAI's Precision-2 model, which is designed to handle concurrent speech. Diarization is available in async workflows only. The speaker diarization documentation covers configuration options, including the maximum expected number of speakers per recording.

Speech speed and clarity's WER impact

Fast speech can blur word boundaries, and deletion errors may increase when the model's training data doesn't include enough examples of rapid, colloquial speech patterns from the speaker's demographic. Rate of speech varies across languages and regional dialects.

Domain vocabulary and transcription WER

Out-of-vocabulary words are a significant source of semantic errors. A generic model encountering an unfamiliar term may produce a phonetically similar substitution that is semantically wrong for your domain. Industry-specific audio often encounters this problem.

Jargon's impact on STT accuracy

Industry-specific terms fail because generic models have never encountered them. "Kubernetes" gets broken into familiar phoneme sequences and returns a plausible-sounding but wrong word. "Metformin" returns a common-word approximation. The model isn't hallucinating randomly, it's making phonetically similar guesses that corrupt the downstream system that relies on the transcript.

Achieving accurate entity transcription

Named entities and acronyms fail for the same reason: proper names sit outside the frequency distribution of general training data, and standard company names, product names, or software tools aren't in the model's vocabulary. When those entity errors reach your CRM sync, they create orphaned records or misattributed contacts.

Acronym transcription fails in two directions: the model expands "AWS" into individual spoken letters or collapses spoken letters into a word that sounds similar. The solution is explicit vocabulary injection rather than post-processing correction.

STT accuracy with novel vocabulary

Custom vocabulary works by providing the model with phonetic hints for terms it would otherwise substitute. Gladia's custom vocabulary parameter accepts both simple strings and structured objects with pronunciation variants and intensity weighting. The intensity field controls how aggressively the model favors the provided term over its default hypothesis.

{
  "url": "YOUR_AUDIO_URL",
  "custom_vocabulary": [
    "Kubernetes",
    {
      "value": "Solaria-1",
      "intensity": 0.8,
      "pronunciations": ["Solar-ia", "Suh-lar-ee-uh"],
      "language": "en"
    }
  ]
}

Optimizing model training for better WER

Modern ASR systems use advanced architectures that can process broader utterance context than earlier systems. Both architecture and training data diversity are critical factors in determining production accuracy.

Diverse training data determines production accuracy: A model trained primarily on clean studio recordings may struggle when encountering real-world noise, compression artifacts, and non-native phoneme patterns. Building diverse training data is a significant undertaking, as documented in STT API benchmark analysis.

Accent recognition performance: Solaria-1 supports 100+ languages, including 42 that no other API-level STT competitor covers. Automatic language detection is available. Language identification accuracy can degrade on heavily accented non-native speech, which is why evaluating on audio representative of your actual users matters more than vendor-reported averages.

Boost STT accuracy in target domains: Custom Language Models (CLMs) use domain-specific text corpora to improve contextual disambiguation in specialized domains. For teams processing high-frequency domain audio (earnings calls, medical dictation, legal proceedings), domain-adapted approaches may help reduce OOV substitution rates.

Model size: accuracy and cost trade-offs: Published WER benchmarks on open-source models reflect LibriSpeech test-clean, a dataset of read speech from audiobooks, not production conversational audio, where the same models consistently show higher WER. Self-hosting requires GPU infrastructure and MLOps engineering overhead, which is why many teams move to managed APIs.

Dissecting common transcription errors

Once you understand the four pillars, you can categorize production errors systematically rather than treating every transcript failure as a model deficiency.

Speech-to-text in noisy environments

Test explicitly on audio with varied Signal-to-Noise Ratios. Models that maintain consistent performance across different SNR conditions are more robust than those that excel only in clean audio.

Code-switching transcription quality

Code-switching failures often occur on the segments where language changes happen, because the model may apply incorrect language models to the audio. Measuring this requires a golden dataset that includes language-switching segments with manually verified reference transcripts in both languages. Standard WER tools like JiWER let you compute WER per file and compare results across the dataset.

Reducing WER for specialized terms

Several interventions can help reduce domain-specific WER:

Custom vocabulary injection: Pass domain-specific terms via the API parameter on every request. This targets named entities and jargon directly.
Post-processing correction: Route the transcript through a domain-aware correction pipeline to catch substitution patterns.
Fine-tuned language model: For very high-volume specialized domains (medical transcription, legal proceedings), domain-adapted models may reduce OOV substitution at inference time.

Solaria-1: head-to-head WER data

Gladia's async benchmark evaluates Solaria-1 against 8 providers across 7 datasets and 74+ hours of audio with reproducible methodology.

Solaria-1 delivers on average 29% lower WER on conversational speech and 3x lower DER compared to alternatives across the evaluated conditions.

How to measure production STT accuracy

Vendor-reported benchmarks give you a starting point. Your own production audio gives you the answer.

Actual WER vs. reported benchmarks

Clean-audio benchmarks show modern systems achieving low WER under ideal conditions. Production conversational audio in real contact centers or meeting recordings often shows higher WER from the same models when tested on uncurated conditions. That gap is the production risk you must measure with your own audio.

Assessing STT for your specific use cases

Build a golden dataset from your actual production audio, not synthetic recordings. Pull recordings that represent your actual use case: noise levels, accents, domain jargon, and conversation types. Produce reference transcripts for evaluation.

Building reliable STT benchmarks

Step 1: Gather a representative golden dataset. Collect audio files that match your production conditions: noise profile, speaker accents, domain jargon, and audio quality. Include files that reflect your actual traffic patterns. Produce reference transcripts for each file.

Step 2: Transcribe across candidate vendors. Submit identical audio files to each candidate API using consistent settings. Record processing time, latency, and output format for each. Use consistent feature configurations for fair comparison.

Step 3: Calculate WER and DER using open-source tools. Tools like the JiWER Python library can compute WER per file and average across the dataset. Review outlier files with significantly higher WER than the mean to understand where each model struggles. Consider cost-per-hour alongside accuracy.

STT evaluation checklist:

Golden dataset covers your actual noise levels, not studio audio
Test files include accented speech representative of your users
Reference transcripts produced for evaluation
Domain-specific terms appear in test files
Code-switching segments included if your product serves multilingual users
DER measured for multi-speaker audio
Cost modeled with all features enabled

Start with Gladia's free tier and run Solaria-1 against your own golden dataset. Compare the results against our published benchmark methodology to see how your audio conditions stack up.

FAQs

What is an acceptable WER in production?

For clean, single-speaker audio like podcasts, low single-digit WER is achievable. For noisy multi-speaker call center audio, higher WER is often acceptable depending on downstream LLM robustness.

Which factor impacts STT accuracy the most?

Speaker overlap, non-native accents, and code-switching are significant drivers of WER degradation in production because they introduce acoustic patterns underrepresented in most training datasets. Background noise compounds these effects.

How much does custom vocabulary improve accuracy for domain-specific terms?

Implementing a custom vocabulary parameter with targeted domain-specific terms can reduce domain-specific entity error rates in production deployments. Gladia's intensity and pronunciation fields let you tune aggressiveness per term to avoid overcorrection on ambiguous phoneme sequences.

How long does it take to validate an STT integration?

Building and testing a representative golden dataset requires careful planning and verification of reference transcripts. Integrating a production-ready STT API like Gladia typically requires minimal developer time from first API call to a working pipeline, based on reported integration timelines from customers across the meeting assistant and CCaaS segments.

Key terms glossary

Word Error Rate (WER): The percentage of word-level edits (substitutions, deletions, insertions) needed to correct a transcript hypothesis to match the reference. The standard ASR benchmark metric.

Diarization Error Rate (DER): Measures speaker attribution accuracy by calculating false alarms, missed detections, and speaker confusions across total audio duration. Critical for multi-speaker transcription quality.

Code-switching: Mid-conversation language changes within a single utterance or across sentences. Common in multilingual contact centers and requires native model support to avoid WER degradation on the switching segments.

Real-Time Factor (RTF): Processing time divided by audio duration. RTF below 1.0 indicates faster-than-real-time processing. Gladia's processing time is approximately 60 seconds per 3,600 seconds (1 hour) of audio content, giving an RTF of approximately 0.0167.

Out-of-vocabulary (OOV): Words absent from the model's training data that cause phonetically similar substitution errors. Domain jargon and proper names are common OOV sources.

Signal-to-Noise Ratio (SNR): Ratio of speech signal strength to background noise measured in decibels. Lower SNR degrades WER by masking phonemes below the noise floor.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Speech-To-Text

Call center transcription software: what enterprises should look for in 2026

Speech-To-Text

PII redaction for call recordings: how ingestion-level redaction keeps calls PCI compliant

Speech-To-Text

GDPR, SOC 2, and ISO 27001 speech-to-text: the contact center compliance and certification guide

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Read more

Call center transcription software: what enterprises should look for in 2026

PII redaction for call recordings: how ingestion-level redaction keeps calls PCI compliant

GDPR, SOC 2, and ISO 27001 speech-to-text: the contact center compliance and certification guide

Factors affecting the accuracy of speech-to-text transcripts

Optimizing speech-to-text word error rate

How WER measures transcription quality

Four pillars of STT accuracy

Input audio: minimizing WER and latency

Fine-tuning sample rate for WER

Choosing codecs for STT performance

Noise: what's your accuracy cost?

How speaker traits influence transcription

Regional dialects and STT performance

How code-switching affects STT

How overlapping speech inflates DER

Speech speed and clarity's WER impact

Domain vocabulary and transcription WER

Jargon's impact on STT accuracy

Achieving accurate entity transcription

STT accuracy with novel vocabulary

Optimizing model training for better WER

Dissecting common transcription errors

Speech-to-text in noisy environments

Code-switching transcription quality

Reducing WER for specialized terms

Solaria-1: head-to-head WER data

How to measure production STT accuracy

Actual WER vs. reported benchmarks

Assessing STT for your specific use cases

Building reliable STT benchmarks

FAQs

What is an acceptable WER in production?

Which factor impacts STT accuracy the most?

How much does custom vocabulary improve accuracy for domain-specific terms?

How long does it take to validate an STT integration?

Key terms glossary

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.