A model's baseline benchmark score rarely predicts its production Word Error Rate, and the problem rarely originates from the model itself. It originates from the gap between studio-quality evaluation audio and the messy, multilingual, overlapping speech your actual users produce.
Transcription accuracy is foundational for downstream AI features in your product. A wrong name can corrupt a CRM entry. A missed entity can produce a misleading coaching score. A mishandled language switch can degrade output quality. This article breaks down the four pillars that dictate STT accuracy, explains how to measure each, and gives you a framework for benchmarking models against the real-world conditions your users create.
Optimizing speech-to-text word error rate
Before diving into the failure modes, you need a consistent measurement framework. Accuracy claims mean nothing without specifying the metric, the audio condition, and the dataset.
How WER measures transcription quality
WER is the foundational metric for comparing STT systems. The formula counts word-level edits needed to turn a hypothesis into a reference transcript, divided by total reference words:
WER = (Substitutions + Deletions + Insertions) / Reference Words
Substitutions happen when the model replaces one word with another. Deletions happen when it drops a word entirely. Insertions add a word that was never spoken. A standard WER calculation across a 20-word sentence with two errors yields a 10% WER.
WER is necessary but not sufficient on its own. Two other metrics matter for production evaluation:
| Metric |
What it measures |
Key limitation |
| WER |
Word-level edit distance between hypothesis and reference |
Doesn't distinguish high-impact errors from minor ones |
| Normalized Error Rate (NER) |
WER after text normalization (lowercasing, punctuation removal, number expansion) |
May obscure meaningful differences between systems |
| Semantic Accuracy |
Whether the meaning of the utterance is preserved |
A 5% WER transcript that drops "not" from "do not approve" has 0% semantic accuracy for that clause |
| RTF |
Processing time divided by audio duration |
RTF below 1.0 indicates faster-than-real-time processing. Gladia's processing time is approximately 60 seconds per 3,600 seconds (1 hour) of audio content |
Optimizing purely for WER doesn't always maximize meaning preservation. A transcript can look accurate by edit-distance while still failing the downstream system that depends on it.
Four pillars of STT accuracy
Every transcription failure in production traces back to one of four root causes:
- Input audio: Sample rate, codec, and signal-to-noise ratio determine the acoustic information available to the model.
- Speaker traits: Accents, code-switching, and concurrent voices create patterns the model may not have learned during training.
- Domain vocabulary: Out-of-vocabulary (OOV) words and named entities cause substitution errors that break downstream NLP pipelines.
- Model architecture and training data: The breadth and diversity of training data sets the ceiling for how well the model handles the first three pillars.
Input audio: minimizing WER and latency
The acoustic signal reaching the model limits what's possible regardless of model quality. Internet connectivity drops and packet loss in VoIP calls further degrade the signal by forcing the model to reconstruct audio from incomplete data.
Fine-tuning sample rate for WER
Sample rate determines the maximum frequency the model receives. Higher sample rates capture a broader frequency range that can improve phoneme discrimination. At 8kHz (standard telephony), the model receives limited frequency information. At 16kHz (broadband), the model gains access to higher frequency bands, which can improve consonant recognition for modern ASR systems.
For any audio you control at the capture layer, 16kHz or higher is the correct starting point. Gladia accepts WAV, M4A, FLAC, AAC, and URL inputs, with recommended parameters documented per use case.
Choosing codecs for STT performance
Lossless codecs (FLAC, WAV) are generally preferred for STT as they preserve more of the original waveform. Lossy codecs (MP3, AAC, compressed VoIP) may introduce artifacts through compression. For any audio you control at the source, favor WAV or FLAC.
Noise: what's your accuracy cost?
Background noise degrades accuracy by masking speech when it occupies similar frequency ranges as the vocal signal. HVAC hum adds constant low-frequency energy. Street traffic and call center floor noise introduce overlapping conversations in the ambient mix.
When SNR drops far enough that background noise approaches or exceeds speech signal levels, deletion errors increase as words disappear below the noise floor and insertion errors increase as noise gets misclassified as speech.
For Contact Center as a Service (CCaaS) environments where audio quality is outside your control, model robustness on noisy audio becomes the primary evaluation criterion. Test vendors explicitly on telephony-grade, noisy recordings before committing.
How speaker traits influence transcription
Models learn acoustic patterns from training data. When a speaker's phoneme patterns don't match the demographics of the training corpus, accuracy drops. Many legacy STT systems were built on datasets with limited demographic diversity, so speakers from underrepresented groups often experience higher error rates.
Regional dialects and STT performance
L1 (native language) phoneme patterns bleed into L2 (second language) speech in predictable ways. An L1-Hindi speaker producing the English phoneme /θ/ ("the") typically substitutes /d/ or /t/, producing "de" in place of "the." A model trained on American English encounters "de" and searches its vocabulary for phonetically similar common words, often returning "they" or "their," not the intended word.
The result is systematic substitution errors tied to the speaker's L1, not random noise. Products serving multilingual user bases accumulate these errors across every call, and they compound in downstream systems where a correctly sounding but wrong word corrupts CRM entries or coaching scores.
How code-switching affects STT
Code-switching describes mid-conversation language changes, from switching sentences to switching within a single utterance ("I need this done, c'est très urgent"). Single-language ASR models struggle at these boundaries: when the model identifies the first audio segment as English, it may apply English phoneme and language models to the French segment, producing errors that look like confident but wrong transcripts.
The contact center impact is direct: agents in Southeast Asia, South Asia, and Latin America routinely switch between English and local languages within a single call. Without native code-switching support, the model may produce errors in the switching segments that break downstream sentiment analysis and entity extraction.
Solaria-1 handles code-switching by assigning a language code per word in the transcript output. A typical response for a mid-sentence English-to-French switch looks like this:
{
"transcription": [
{
"type": "word",
"transcription": "I",
"language": "en",
"start_time": 0.0,
"end_time": 0.2
},
{
"type": "word",
"transcription": "need",
"language": "en",
"start_time": 0.2,
"end_time": 0.5
},
{
"type": "word",
"transcription": "c'est",
"language": "fr",
"start_time": 1.2,
"end_time": 1.5
}
]
}
The code-switching documentation covers configuration for both async and real-time modes. For a deeper technical comparison of code-switching behavior across providers, the technical comparison for production speech-to-text examines specific implementation differences.
How overlapping speech inflates DER
Diarization Error Rate (DER) measures speaker attribution accuracy. When two speakers talk simultaneously, the model receives a mixed signal: a single audio frame containing two voices. This breaks the assumption that each frame belongs to one speaker, forcing the diarization system to make an attribution guess under ambiguous conditions.
DER is calculated from three components: false alarm (attributing speech to a non-existent speaker), missed detection (failing to detect speech), and speaker confusion (assigning speech to the wrong speaker). Overlapping speech inflates all three. Missed detection increases when the overlapping frame is suppressed entirely. Speaker confusion increases when the model assigns the mixed frame to the wrong speaker. False alarms increase when ambient bleed from the overlapping voice is mistaken for a third speaker. In a two-speaker contact center call, a speaker confusion error can reassign agent utterances to the customer, corrupting sentiment scores and CRM ownership attribution in downstream workflows.
Gladia's async diarization is powered by pyannoteAI's Precision-2 model, which is designed to handle concurrent speech. Diarization is available in async workflows only. The speaker diarization documentation covers configuration options, including the maximum expected number of speakers per recording.
Speech speed and clarity's WER impact
Fast speech can blur word boundaries, and deletion errors may increase when the model's training data doesn't include enough examples of rapid, colloquial speech patterns from the speaker's demographic. Rate of speech varies across languages and regional dialects.
Domain vocabulary and transcription WER
Out-of-vocabulary words are a significant source of semantic errors. A generic model encountering an unfamiliar term may produce a phonetically similar substitution that is semantically wrong for your domain. Industry-specific audio often encounters this problem.
Jargon's impact on STT accuracy
Industry-specific terms fail because generic models have never encountered them. "Kubernetes" gets broken into familiar phoneme sequences and returns a plausible-sounding but wrong word. "Metformin" returns a common-word approximation. The model isn't hallucinating randomly, it's making phonetically similar guesses that corrupt the downstream system that relies on the transcript.
Achieving accurate entity transcription
Named entities and acronyms fail for the same reason: proper names sit outside the frequency distribution of general training data, and standard company names, product names, or software tools aren't in the model's vocabulary. When those entity errors reach your CRM sync, they create orphaned records or misattributed contacts.
Acronym transcription fails in two directions: the model expands "AWS" into individual spoken letters or collapses spoken letters into a word that sounds similar. The solution is explicit vocabulary injection rather than post-processing correction.
STT accuracy with novel vocabulary
Custom vocabulary works by providing the model with phonetic hints for terms it would otherwise substitute. Gladia's custom vocabulary parameter accepts both simple strings and structured objects with pronunciation variants and intensity weighting. The intensity field controls how aggressively the model favors the provided term over its default hypothesis.
{
"url": "YOUR_AUDIO_URL",
"custom_vocabulary": [
"Kubernetes",
{
"value": "Solaria-1",
"intensity": 0.8,
"pronunciations": ["Solar-ia", "Suh-lar-ee-uh"],
"language": "en"
}
]
}
Optimizing model training for better WER
Modern ASR systems use advanced architectures that can process broader utterance context than earlier systems. Both architecture and training data diversity are critical factors in determining production accuracy.
Diverse training data determines production accuracy: A model trained primarily on clean studio recordings may struggle when encountering real-world noise, compression artifacts, and non-native phoneme patterns. Building diverse training data is a significant undertaking, as documented in STT API benchmark analysis.
Accent recognition performance: Solaria-1 supports 100+ languages, including 42 that no other API-level STT competitor covers. Automatic language detection is available. Language identification accuracy can degrade on heavily accented non-native speech, which is why evaluating on audio representative of your actual users matters more than vendor-reported averages.
Boost STT accuracy in target domains: Custom Language Models (CLMs) use domain-specific text corpora to improve contextual disambiguation in specialized domains. For teams processing high-frequency domain audio (earnings calls, medical dictation, legal proceedings), domain-adapted approaches may help reduce OOV substitution rates.
Model size: accuracy and cost trade-offs: Published WER benchmarks on open-source models reflect LibriSpeech test-clean, a dataset of read speech from audiobooks, not production conversational audio, where the same models consistently show higher WER. Self-hosting requires GPU infrastructure and MLOps engineering overhead, which is why many teams move to managed APIs.
Dissecting common transcription errors
Once you understand the four pillars, you can categorize production errors systematically rather than treating every transcript failure as a model deficiency.
Speech-to-text in noisy environments
Test explicitly on audio with varied Signal-to-Noise Ratios. Models that maintain consistent performance across different SNR conditions are more robust than those that excel only in clean audio.
Code-switching transcription quality
Code-switching failures often occur on the segments where language changes happen, because the model may apply incorrect language models to the audio. Measuring this requires a golden dataset that includes language-switching segments with manually verified reference transcripts in both languages. Standard WER tools like JiWER let you compute WER per file and compare results across the dataset.
Reducing WER for specialized terms
Several interventions can help reduce domain-specific WER:
- Custom vocabulary injection: Pass domain-specific terms via the API parameter on every request. This targets named entities and jargon directly.
- Post-processing correction: Route the transcript through a domain-aware correction pipeline to catch substitution patterns.
- Fine-tuned language model: For very high-volume specialized domains (medical transcription, legal proceedings), domain-adapted models may reduce OOV substitution at inference time.
Solaria-1: head-to-head WER data
Gladia's async benchmark evaluates Solaria-1 against 8 providers across 7 datasets and 74+ hours of audio with reproducible methodology.
Solaria-1 delivers on average 29% lower WER on conversational speech and 3x lower DER compared to alternatives across the evaluated conditions.
How to measure production STT accuracy
Vendor-reported benchmarks give you a starting point. Your own production audio gives you the answer.
Actual WER vs. reported benchmarks
Clean-audio benchmarks show modern systems achieving low WER under ideal conditions. Production conversational audio in real contact centers or meeting recordings often shows higher WER from the same models when tested on uncurated conditions. That gap is the production risk you must measure with your own audio.
Assessing STT for your specific use cases
Build a golden dataset from your actual production audio, not synthetic recordings. Pull recordings that represent your actual use case: noise levels, accents, domain jargon, and conversation types. Produce reference transcripts for evaluation.
Building reliable STT benchmarks
Step 1: Gather a representative golden dataset. Collect audio files that match your production conditions: noise profile, speaker accents, domain jargon, and audio quality. Include files that reflect your actual traffic patterns. Produce reference transcripts for each file.
Step 2: Transcribe across candidate vendors. Submit identical audio files to each candidate API using consistent settings. Record processing time, latency, and output format for each. Use consistent feature configurations for fair comparison.
Step 3: Calculate WER and DER using open-source tools. Tools like the JiWER Python library can compute WER per file and average across the dataset. Review outlier files with significantly higher WER than the mean to understand where each model struggles. Consider cost-per-hour alongside accuracy.
STT evaluation checklist:
- Golden dataset covers your actual noise levels, not studio audio
- Test files include accented speech representative of your users
- Reference transcripts produced for evaluation
- Domain-specific terms appear in test files
- Code-switching segments included if your product serves multilingual users
- DER measured for multi-speaker audio
- Cost modeled with all features enabled
Start with Gladia's free tier and run Solaria-1 against your own golden dataset. Compare the results against our published benchmark methodology to see how your audio conditions stack up.
FAQs
What is an acceptable WER in production?
For clean, single-speaker audio like podcasts, low single-digit WER is achievable. For noisy multi-speaker call center audio, higher WER is often acceptable depending on downstream LLM robustness.
Which factor impacts STT accuracy the most?
Speaker overlap, non-native accents, and code-switching are significant drivers of WER degradation in production because they introduce acoustic patterns underrepresented in most training datasets. Background noise compounds these effects.
How much does custom vocabulary improve accuracy for domain-specific terms?
Implementing a custom vocabulary parameter with targeted domain-specific terms can reduce domain-specific entity error rates in production deployments. Gladia's intensity and pronunciation fields let you tune aggressiveness per term to avoid overcorrection on ambiguous phoneme sequences.
How long does it take to validate an STT integration?
Building and testing a representative golden dataset requires careful planning and verification of reference transcripts. Integrating a production-ready STT API like Gladia typically requires minimal developer time from first API call to a working pipeline, based on reported integration timelines from customers across the meeting assistant and CCaaS segments.
Key terms glossary
Word Error Rate (WER): The percentage of word-level edits (substitutions, deletions, insertions) needed to correct a transcript hypothesis to match the reference. The standard ASR benchmark metric.
Diarization Error Rate (DER): Measures speaker attribution accuracy by calculating false alarms, missed detections, and speaker confusions across total audio duration. Critical for multi-speaker transcription quality.
Code-switching: Mid-conversation language changes within a single utterance or across sentences. Common in multilingual contact centers and requires native model support to avoid WER degradation on the switching segments.
Real-Time Factor (RTF): Processing time divided by audio duration. RTF below 1.0 indicates faster-than-real-time processing. Gladia's processing time is approximately 60 seconds per 3,600 seconds (1 hour) of audio content, giving an RTF of approximately 0.0167.
Out-of-vocabulary (OOV): Words absent from the model's training data that cause phonetically similar substitution errors. Domain jargon and proper names are common OOV sources.
Signal-to-Noise Ratio (SNR): Ratio of speech signal strength to background noise measured in decibels. Lower SNR degrades WER by masking phonemes below the noise floor.