When your offshore agents say "AuraSync Pro TX-900" and your transcription engine returns "aura sync protects 900," every downstream system works with wrong data. Your QA scorecard marks non-compliance on product identification, your CRM logs the wrong SKU, and your coaching summary references a product that doesn't exist. This isn't an LLM problem or a prompt engineering problem. Custom vocabulary dictionaries solve it at the source by using phoneme-similarity matching to guide transcription toward the correct output before the transcript reaches any downstream system.
Defining custom vocabulary for call transcripts
Mechanism of custom speech dictionaries
Commercial STT (speech-to-text) APIs typically run a language model during decoding to predict which words most likely follow each other in context. When a speaker says a proprietary term, a generic model may not have encountered that token sequence in training data, so it often picks an acoustically similar common phrase and outputs that instead. That is an out-of-vocabulary (OOV) failure.
Our custom vocabulary implementation converts both the transcribed output and your registered vocabulary entries into phonemes, then scores them for similarity. When phoneme similarity between a transcribed word and a vocabulary entry clears the configured intensity threshold, the engine replaces the output with your registered term. The intensity parameter controls how aggressively replacements fire: higher values widen the phoneme match window, while lower values require a closer acoustic fit before substitution occurs.
This approach differs from post-transcription find-and-replace rules. A find-and-replace rule can only match text already close to the target string, so if the model transcribes "AuraSync" as "aurora sink," the string replacement never fires because neither word appears in the rule. Phoneme-based matching can catch that failure because "aurora sink" and "AuraSync" may share a high phoneme similarity score even though the character strings diverge completely.
Transcription errors in domain terminology
Generic speech-to-text models typically encounter domain-specific terms that fall outside their training vocabulary. The failure modes are predictable: alphanumeric identifiers often get split or misheard, brand names with unusual phonetics map to phonetically similar common words, and acronyms either spell out phonetically or get misheard as plausible-sounding words entirely.
The damage compounds as transcripts move downstream. When an LLM (large language model) receives a transcript containing those errors to generate a QA scorecard, it cannot reconstruct what was actually said at the acoustic layer and confidently scores the call based on the incorrect text. Your QA team then manually overrides the score, invalidating the automation you paid to build. Our guide on call transcription accuracy benchmarks covers how to measure this systematically across your contact center operation, and the factors affecting transcription accuracy piece explains the specific audio conditions that amplify these errors.
Solving transcription errors with custom vocabulary
Capturing specific product and SKU data
Alphanumeric SKUs frequently produce OOV failures in contact center audio. A model trained on general speech has no prior probability for patterns like "TX-900" as a cohesive unit, so decoding treats each character group separately and picks the phonetically nearest common word. Custom vocabulary forces the engine to treat the SKU as a registered phonetic pattern. Register "TX-900" with a pronunciation hint like "tee-ex-nine-hundred" and the phoneme matcher catches the acoustic pattern even when accent or speaking pace causes the base model to diverge initially. For contact centers handling returns, warranty claims, or technical support, the SKU drives your CRM (customer relationship management) record accuracy because it's the primary entity your downstream systems need to populate correctly.
Custom vocabulary for brand entities
Your product intelligence and competitive monitoring dashboards need sub-brand and competitor mentions to transcribe accurately when agents reference them on calls. When brand terms transcribe incorrectly, they often become difficult to track in your analytics, reducing visibility into how often agents mention them or how customers respond.
The table below shows common failure patterns and the downstream QA impact without custom vocabulary active:
| What was said |
Generic STT output |
Custom-tuned STT output |
Downstream QA impact |
| "AuraSync Pro TX-900" |
"aurora sync pro text 900" |
"AuraSync Pro TX-900" |
Incorrect SKU recorded |
| "ACH payment" |
"A C H payment" |
"ACH payment" |
Payment type unclear |
| "HIPAA authorization" |
"HIPPA authorization" |
"HIPAA authorization" |
Regulatory term misspelled |
| "Qbii Technologies" |
"cubie technologies" |
"Qbii Technologies" |
Brand name incorrect |
| "FCR (first call resolution) score of 87" |
"F C R score of 87" |
"FCR score of 87" |
Metric formatting inconsistent |
Resolving abbreviation transcription errors
Industry-specific acronyms can produce transcription inconsistencies: models may spell them out phonetically or mishear the letters entirely. Both patterns can break automated compliance checks that look for specific keyword patterns.
For regulated industries, the impact reaches beyond accuracy. A call requiring the agent to confirm "HIPAA authorization" produces a transcript reading "he papered authorization." The automated compliance check finds no match for "HIPAA" and flags the call as non-compliant. Your QA team manually reviews it, confirms the agent handled the call correctly, and overrides the score, producing a systematic false positive stream that erodes your entire automated QA investment.
Custom vocabulary for agent phrases
BPO (business process outsourcing) operations add another layer of complexity. Offshore agents in the Philippines, India, or Latin America handle calls in English, but their pronunciation patterns for brand-specific terms often diverge from the accent profiles dominant in generic training data. We built Solaria-1 to handle accented speech across 100+ supported languages, benchmarked against 8 providers across 7 datasets and 74+ hours of audio with open and reproducible methodology. Adding custom vocabulary on top of that baseline gives your BPO operation a transcription layer that is both accent-robust and domain-aware.
"Gladia's transcriptions cater well to multilingual requirements, thus significantly aiding our customer support in a complex multilingual setup." - Pratik S. on G2
How to build a custom vocabulary list for your contact center
Step 1: Audit existing call transcripts
Start with your current transcripts, not a blank list. Pull a recent batch of calls and run them through a word frequency analysis against your known product catalog and internal lexicon. Look specifically for calls where QA analysts have manually overridden automated scores, because those overrides mark exactly where transcription failed downstream logic. Calls involving returns, warranty claims, pricing disputes, and compliance confirmations often contain high densities of domain-specific terms and make valuable audit sources.
Step 2: Collect product catalogs and SKUs
Export your active product list, SKU registry, and any competitor terms your agents are trained to track from your CRM or ERP (enterprise resource planning). Filter to active products only, since deprecated SKUs may add noise and increase false positive risk. Include sub-brand names, product line groupings, and any internal codenames that appear in scripted agent language. Refresh this export whenever a product launches, is discontinued, or gets rebranded so the dictionary reflects your current catalog from the first call after any change.
Step 3: Building your vocabulary library
For each term, determine whether the written form and the spoken form diverge enough that the base model needs guidance. The API accepts vocabulary entries as simple strings for straightforward terms or as objects with value, pronunciations, intensity, languagefor terms with language-specific pronunciation patterns. A brand name like "Salesforce" that agents occasionally say as "sell force" or "sale forces" pronunciations may benefit from multiple pronunciation variants, giving the phoneme matcher a wider target to catch the acoustic signal correctly.
Step 4: Prioritize high-frequency terms
Resist adding every term from your product catalog at once. Entries with very low occurrence rates in real call audio may cause the matcher to fire on acoustically similar common words, producing false positives that can be harder to debug than the original OOV errors. Prioritize terms that appear frequently across your call volume plus any term critical for compliance checking regardless of frequency. Start with a focused set of high-priority entries, measure false positive rate against a manual call sample, and expand the list incrementally based on what the data shows.
Managing your custom vocabulary for ongoing accuracy
Sync transcription terms with product
Your product catalog is not static, and your vocabulary dictionary can't be either. Establish a formal handoff between your product team and your transcription operations whenever new SKUs launch or brands change. A practical approach is including the transcription vocabulary update as a line item in your product launch checklist, alongside agent scripts and IVR prompt updates, so the dictionary reflects current products from the first call after launch.
Audit transcription accuracy quarterly
Set a regular review cadence where your QA team samples calls containing custom vocabulary terms and manually checks that the correct forms appear in the output. Measure two things: recall rate (target terms spoken and correctly transcribed) and false positive rate (common words incorrectly replaced by vocabulary entries). If recall is low, consider raising the default_intensity parameter. If false positives are high, consider lowering it or refining the pronunciations list for the offending entries. The custom vocabulary documentation provides guidance on tuning these parameters based on observed performance.
Aligning QA and training on vocabulary
Your QA team and agent training team need to operate from the same vocabulary master list. When agents are trained to say "FCR score" instead of "first call resolution score," your vocabulary list should reflect that scripted form so automated QA checks work correctly against the transcribed output.
For compliance-sensitive operations, review your vocabulary list to ensure it aligns with your data handling policies and applicable jurisdictionrequirements.
When to use custom dictionaries over spelling rules
Transcribing complex product names accurately
We provide two distinct correction mechanisms: custom vocabulary and custom spelling. Understanding which to use for a given failure prevents over-engineering your dictionary and reduces false positives.
Custom spelling handles cases where the base model transcribes the word correctly at the phoneme level but formats it wrong. If the model outputs "salesforce" in lowercase when your CRM requires "Salesforce," custom spelling handles that formatting correction without phoneme comparison. If the model outputs "sale forces" because it genuinely misheard the acoustic signal, custom spelling cannot catch that because "Salesforce" never appears in the transcript to be reformatted. That is a vocabulary problem requiring the phoneme-matching approach.
When to prioritize custom dictionaries
Use custom vocabulary when the failure is acoustic: the engine outputs a different word entirely because it doesn't recognize the phonetic pattern as a valid token. Use custom spelling when the failure is orthographic: the engine recognizes the word but formats or capitalizes it incorrectly.
For example, "sell force" for "Salesforce" is an acoustic failure requiring vocabulary. "salesforce" in lowercase when you need "Salesforce" capitalized is a formatting failure requiring spelling correction. Mapping failures to the right mechanism before building your configuration saves significant debugging time in production.
Combining methods for peak accuracy
The most reliable production setup uses both mechanisms together. Register domain-specific terms and their acoustic variants in your vocabulary list with intensity tuned to each term's collision risk with common words. Use custom spelling to handle formatting consistency for terms the base model recognizes correctly but presents in the wrong form. Together, these two layers cover the full range of transcription errors without requiring model retraining.
Impact of custom dictionaries on transcription WER
Baseline WER without custom vocabulary
WER measures the proportion of words in a transcript that differ from the ground truth, calculated as the sum of substitutions, deletions, and insertions divided by total reference words. On general conversational audio, a well-configured base model produces a stable WER you can plan around. The problem for contact centers is that OOV failures do not distribute evenly across the transcript; they concentrate on exactly the terms your downstream systems depend on most: product names, compliance phrases, alphanumeric SKUs, and brand-specific acronyms. A transcript that is 97% accurate by word count can still have the product SKU wrong on every call, because those few high-value tokens are the ones the base model is least likely to have encountered in training. That concentration effect means your effective error rate on QA-critical content is significantly worse than your headline WER suggests. Before applying custom vocabulary, that gap is invisible in aggregate metrics and only surfaces when QA analysts start logging manual overrides on product identification and compliance phrase checks.
Reducing WER with specific term lists
WER improves when substitution errors on domain-specific tokens are eliminated. Every time the base model outputs "aurora sink" instead of "AuraSync," that counts as one substitution in your WER calculation. Register "AuraSync" with an appropriate pronunciation hint and the phoneme matcher intercepts the substitution before it reaches your transcript. The net effect is that your substitution count on those high-value tokens drops to near-zero, which pulls down your domain-specific WER even when your headline WER across general conversational words stays roughly constant. The practical lever is your vocabulary list's coverage and intensity calibration.
A list that covers 90% of your high-frequency domain terms eliminates 90% of the OOV substitutions driving your QA override queue. Intensity tuning controls precision: too high and you introduce false positive substitutions that add new errors; too low and you miss acoustically distorted variants and leave substitutions uncorrected. Running a vocabulary configuration against a manually reviewed call batch and measuring per-term recall before and after gives you a direct read on WER delta per entry, which tells you exactly which terms are pulling weight and which need pronunciation refinement. A financial services customer running high-volume call processing reported 98.5% numerical accuracy in production after combining base-model transcription with domain-specific vocabulary configuration for numeric identifiers and product codes. Verified production performance context is available on our CCaaS use case page.
Improving QA scorecard accuracy
When your QA automation scores agent compliance on product identification and your transcripts contain the correct product terms, false negative QA flags drop and your coaching interventions target actual agent behavior rather than transcription errors. Agents stop getting coached on problems they don't have, QA analysts spend less time on manual override reviews, and your cost-per-contact reflects the real cost of your operation rather than the overhead of correcting systematic transcription failures.
The ROI runs directly through transcript quality. Contact centers that implement accurate transcription infrastructure can shift QA teams from manually reviewing calls to validating AI findings, a structural shift in QA economics that isn't achievable when your transcription layer produces systematic domain errors.
Deployment guide: adding domain terms to Gladia
Step 1: Define key domain vocabulary
Format your vocabulary entries before making the API call. Start with simple strings for terms where the written and spoken forms are close enough for the phoneme matcher to catch at default_intensity: 0.4default settings. For terms with a significant gap between written and spoken form, build out full entry objects with pronunciationspronunciation arrays. Consider using bare strings for common brand names and full objects for proprietary SKUs and compliance phrases where pronunciation varies from the written form.
Step 2: Programmatic custom dictionary setup
Pass the custom_vocabulary parameter in an async transcription request:
import requests
url = "https://api.gladia.io/v2/pre-recorded"
headers = {
"x-gladia-key": "YOUR_GLADIA_API_KEY",
"Content-Type": "application/json"
}
payload = {
"audio_url": "https://your-storage.example.com/call-recording.wav",
"custom_vocabulary": True,
"custom_vocabulary_config": {
"vocabulary": [
"AuraSync",
{"value": "TX-900", "pronunciations": ["tee-ex-nine-hundred", "text ninehundred"]},
{
"value": "Qbii Technologies",
"pronunciations": ["Q-Bee Technologies", "Q bee technology"],
"intensity": 0.5,
"language": "en"
}
],
"default_intensity": 0.4
}
}
response = requests.post(url, json=payload, headers=headers)
print(response.json())
The default_intensity applies to all entries without a per-entry override. Per-entry intensity may allow you to tune aggressiveness at the term level, which matters when a SKU phonetically overlaps with common words and needs a tighter threshold than your general vocabulary. Full parameter details are in the custom vocabulary docs.
Step 3: Test vocabulary against live calls
Before deploying to production, run your new vocabulary configuration against a test batch of calls you have already manually reviewed. Compare output transcripts against your baseline and look for improvements in target term accuracy as well as any new false positives where common words were incorrectly replaced. Review any entry showing a high false positive rate before scaling to full call volume.
Step 4: Live transcription integration
Once your test batch clears your quality threshold, deploy to production by adding the custom_vocabulary_config block to your standard async transcription request. Our infrastructure is designed to scale from test volumes to production volumes. Aircall processes over 1 million calls per week through Gladia, and multiple customers report sub-24-hour integration timelines from API connection to production deployment.
Test your custom vocabulary configuration on your own call audio to see how Solaria-1 handles your product names, BPO accents, and domain-specific terms, with async processing at $0.20/hr on Growth and volume pricing on Enterprise.
FAQs
What is an out-of-vocabulary (OOV) failure in contact center transcription?
An OOV failure occurs when a speech-to-text model encounters a word it has no trained probability for, so it typically outputs a phonetically similar common word instead. In contact centers, OOV failures commonly affect proprietary product names, brand-specific acronyms, and alphanumeric SKUs.
How many custom vocabulary entries does Gladia support?
We support custom vocabulary across plans. Contact sales for specific entry limits on Enterprise. Other providers vary: some cap vocabulary files or entries, while others allow thousands of phrases per job depending on language.
Does adding custom vocabulary affect Gladia's transcription latency?
No. Our custom vocabulary implementation is designed to minimize latency impact on both real-time and async processing throughput.
What is the difference between custom vocabulary and custom spelling in Gladia?
Custom vocabulary uses phoneme-similarity matching to catch terms the model acoustically misheard and output a different word entirely. Custom spelling handles orthographic corrections where the model recognized the word correctly but formatted or capitalized it incorrectly.
What intensity value should I start with for a new vocabulary entry?
Start with a moderate intensity value and adjust based on observed performance. If target terms are still missed, consider raising the value. If unrelated common words are being incorrectly replaced, consider lowering intensity for affected entries or refining the pronunciations array. The custom vocabulary documentation provides specific guidance on tuning.
Do Growth and Enterprise plans use customer audio to train Gladia's models?
No. On Growth and Enterprise plans, customer audio is never used for model training and no opt-out action is required. On the Starter plan, customer data can be used for model training by default.
Key terms glossary
Out-of-vocabulary (OOV) error: A transcription failure where the STT model substitutes a phonetically similar common word because it has no trained probability for the target token. OOV errors concentrate on proprietary product names, alphanumeric SKUs, and brand terms.
Word error rate (WER): The percentage of words in a transcript that differ from the ground truth, calculated as the sum of substitutions, deletions, and insertions divided by total reference words. Lower WER means more accurate transcription.
Phoneme: The smallest unit of sound in a language that distinguishes meaning. Custom vocabulary systems use phoneme comparison to match acoustically similar strings to registered vocabulary entries.
Custom vocabulary intensity: A parameter that controls how closely the phoneme pattern of a transcribed word must match a vocabulary entry before substitution fires. Higher values increase recall at the cost of precision, meaning more target terms get caught but false positive risk rises.
False positive (in vocabulary matching): An incorrect substitution where a common word is replaced by a vocabulary entry because their phoneme patterns exceed the intensity threshold. Controlled by lowering intensity or tightening the pronunciations array for the affected entry.