When a generic speech-to-text model mishears "Salesforce CPQ" as "sales for SCPQ" or drops a proprietary product code entirely, the damage compounds fast. The CRM entry is wrong, the automated QA scorecard misses the product mention, and the LLM generating the post-call summary classifies the interaction as unresolved because the product name never appeared in the transcript. None of this fails loudly, and by the time a QA analyst catches it during manual sampling, the error has already propagated through dozens of calls.
This guide covers the API setup, list curation strategy, and measurement approach that contact center teams need to make custom vocabulary work in production.
When to use custom vocabulary in production
Solving transcription gaps for technical terms
Generic speech-to-text models can struggle with certain contact center conditions, producing acoustic confusion where the model maps an unfamiliar sound sequence to the closest phoneme chain it knows and produces plausible but wrong output. Intent routing can fail at the transcription layer before it reaches the natural language understanding (NLU) layer when transcription errors occur. QA automation is only as reliable as the transcription data feeding it, as call transcription accuracy benchmarks show. Custom vocabulary can help close the gap on domain-specific terms.
Mapping CRM entities for better accuracy
Custom vocabulary acts as a direct bridge between raw audio and structured CRM data. When an agent mentions a specific product tier, contract term, or internal team name, the STT layer needs to surface that exact string for downstream named entity recognition and LLM workflows to operate correctly. If the transcription misses it, your CRM gets populated with either a blank field or a hallucinated substitute.
Every downstream output is ceiling-bounded by STT accuracy. A wrong CRM entry, a wrong translation, and a wrong coaching score all trace back to a single transcription error at the first layer. For operations running automated QA on 100% of interactions, fixing errors at the vocabulary level is far cheaper than catching them downstream.
Vocabulary vs. spelling: choosing the right fix
Custom vocabulary and custom spelling solve different problems, and conflating them leads to misconfigured pipelines. Some providers use the terms interchangeably, but we treat them as distinct tools that work best in combination.
| Feature |
Custom vocabulary (phoneme-based) |
Custom spelling (string replacement) |
| Primary mechanism |
Adjusts decoding behavior to favor target terms during inference |
Performs literal text string replacement on the final transcript |
| Best use case |
Brand names, technical terms, acronyms |
Correcting consistent misspellings, standardizing abbreviations |
| Coverage of phonetic variants |
High (can catch acoustic variants of similar sounds) |
Limited (requires explicit listing) |
| Provider support |
Gladia (native), AWS Transcribe, Rev.ai |
Gladia (native via custom_spelling_config), some others |
Custom spelling does not phonemize anything. It finds exact text matches and replaces them, which makes it safe from false positives but means it requires explicit listing of variants. Custom vocabulary can catch phonetic variants because similar sounds may map to the target output. For a contact center term like "XB-7 Pro," you want custom vocabulary handling phoneme matching and custom spelling normalizing the output format.
For true homophones (words that are pronounced identically, like "write" and "right"), the challenge is semantic disambiguation rather than phonetic differentiation, since both forms share the same phoneme pattern. Custom vocabulary biases the decoder toward specific phoneme sequences but cannot resolve which spelling the speaker intended when multiple valid options exist. Add the target term to your vocabulary list and use custom spelling for any downstream orthographic cleanup needed.
API integration for custom term sets
Secure API access for custom terms
One operational constraint applies across all providers before you build your term list: as a best practice, avoid entering personally identifiable information (PII) or protected health information (PHI) into custom vocabulary files. Use anonymized tokens for any terms that might intersect with customer identifiers, then post-process results to reinsert sensitive values from a secure vault if needed.
Optimizing custom term lists for scale
The choice between runtime vocabulary lists and full model fine-tuning comes down to deployment speed, cost, and the nature of the vocabulary problem.
| Approach |
Development effort |
Accuracy lift on jargon |
Cost |
Best for |
| Runtime lists (Gladia, AWS) |
Fast to deploy |
Can improve targeted term recognition |
Included in base rate (Gladia) |
Dynamic vocabularies that change with product releases |
| Model fine-tuning |
Requires labeled domain audio |
Can improve acoustic environment adaptation |
Significant (infrastructure and engineering time) |
Fixed-domain, stable vocabulary with unique acoustic conditions |
Runtime vocabulary injection biases the decoder's language model at inference time without retraining weights. This is why our async pipeline maintains fast processing speeds even with a custom term list loaded.
Teams already using OpenAI's stack frequently evaluate the Whisper API as a candidate for domain vocabulary work, making it worth addressing directly: the Whisper API has no native word-boosting system, offering only a prompt parameter limited to 224 tokens of spelling hints. That ceiling is insufficient for domain vocabularies of any meaningful size. It can also hallucinate on telephony audio, a known failure mode on 8kHz call recordings that Solaria-1 is specifically designed to avoid in production environments.
Boosting recall for specialized terminology
The custom_vocabulary parameter in Gladia's API accepts the following object structure for each term:
{
"audio_url": "https://your-storage.example.com/call-recording.wav",
"custom_vocabulary": [
"XB-7 Pro",
{ "value": "Salesforce CPQ" },
{
"value": "Westeros",
"pronunciations": ["Wes-teh-ros"],
"intensity": 0.5,
"language": "en"
}
]
}
The value field holds the correctly spelled target term. The intensity property controls how strongly the decoder biases toward that term during the acoustic search, and the pronunciations array accepts phonetic spelling hints. Review the full parameter reference in our custom vocabulary docs before configuring intensity values at scale, as setting intensity too high can cause the model to surface a target term when the audio contains a phonetically similar but different word.
A Python implementation using the standard requests library follows this pattern:
import requests
headers = {
"x-gladia-key": "YOUR_API_KEY",
"Content-Type": "application/json"
}
payload = {
"audio_url": "https://your-storage.example.com/call.wav",
"diarization": True,
"custom_vocabulary": [
"XB-7 Pro",
{"value": "Salesforce CPQ"},
{
"value": "TurboTax Advantage",
"pronunciations": ["Turbo Tax Advantidge"],
"intensity": 0.5
}
]
}
response = requests.post(
"https://api.gladia.io/v2/transcription",
headers=headers,
json=payload
)
print(response.json())
Pronunciation hints for ambiguous terms
The pronunciations array handles terms where standard orthography does not reflect how a word sounds in conversational telephony audio. For a brand name like "Xiaomi" spoken by a bilingual agent, passing a phonetic approximation like "Shao-mee" helps the decoder align the incoming acoustic pattern to the target output. Stack multiple pronunciation variants in the array when the same term is used by agents with different accent backgrounds, which is common in BPO environments where a single queue serves agents across the Philippines, India, and Mexico. For multilingual term sets, review our code-switching documentation for handling mid-sentence language transitions, which compounds the pronunciation challenge.
Curating custom lists from existing call logs
Automating lists from CRM and SKU data
The fastest way to build an initial vocabulary list is to pull it directly from your CRM's product catalog and SKU tables. Most CCaaS platforms expose a REST endpoint or database view of active product names, and a simple ETL job can extract, deduplicate, and format these as a JSON array. Include:
- Active product names and version numbers
- Compliance disclosure phrases that must be transcribed verbatim
- Frequently mentioned partner integrations (for example, Salesforce, Zendesk, HubSpot)
- Competitor names that agents are trained to mention by policy
How to isolate transcription errors
Mining existing transcripts for acoustic confusion patterns extends your vocabulary list beyond the initial CRM pull. The process follows three steps:
- Collect a sample: Pull a representative set of transcripts from recent calls covering your highest-volume product lines and key operational regions.
- Run a diff against ground truth: If you have human-verified transcripts from QA sampling, compute word error rate on domain-specific terms rather than the full transcript. Pay attention to substitution errors, where the model produced a real word instead of the target, since these are harder to catch in downstream analysis.
- Cluster by error pattern: Group errors by the target term being misrecognized. "SaaS" transcribed as "sass" or "sauce" are examples of phoneme-level confusion that custom vocabulary resolves in one entry. Custom vocabulary vs. custom spelling notes that the teams producing the cleanest transcripts are the ones who diagnose each error before configuring anything, rather than dumping their entire product catalog into the API.
Selecting terms for maximum impact
Longer vocabulary lists are not better. Additional terms that do not appear in the audio can degrade word error rate because the decoder expands its search space unnecessarily. Keep lists short and targeted for optimal performance.
Apply a simple prioritization filter before submitting your list:
- Frequency: Prioritize terms that appear regularly in your calls over rare terms that have minimal impact on overall accuracy.
- Error rate on the base model: Test Solaria-1 against your sample set first. Remove any terms the model already transcribes correctly across your audio conditions.
- Downstream impact: Prioritize terms that feed CRM fields, QA scorecard categories, or compliance disclosure tracking. A missed compliance phrase in a regulated disclosure is a liability, not just an accuracy gap.
Optimizing your vocabulary strategy for evolving CX needs
Version control for vocabulary lists
Product names change, SKUs retire, and new service tiers launch on quarterly cycles. Many teams manage vocabulary lists using version control and standard deployment practices to track changes and enable rollbacks if a new term set introduces a regression.
Every vocabulary update should be tested against a golden test set before deployment. The test set is a small corpus of annotated transcripts covering your highest-priority term categories. After adding new terms, run the updated list against the test set and compare keyword WER on existing terms before and after. If accuracy on existing terms drops, a new entry is likely phonetically similar to an established term and needs a pronunciation hint to distinguish it. Our async pipeline maintains fast processing speeds, so you can iterate through test set runs quickly without blocking a release.
Eliminating redundant model terms
Solaria-1 handles a range of common technical terms correctly out of the box. Before adding a term to your custom vocabulary list, test the base model against your sample audio. Any term achieving 100% accuracy across your test conditions can be removed from the list, keeping your API payloads lean and reducing the probability of interference between adjacent vocabulary entries. This pruning habit is especially important for multilingual contact center operations where the same call queue handles speakers across 5-10 languages, each with a different baseline accuracy profile.
Measuring custom vocabulary impact on WER
Baseline WER on telephony calls without custom vocabulary
Establish a baseline before enabling custom vocabulary to quantify the improvement and build a credible ROI case for leadership. Test against your actual production environment: 8kHz telephony audio, realistic background noise, and a representative sample of agent accents from your BPO regions. Global WER captures aggregate performance but may mask domain-specific failures. A model achieving high global word accuracy can still transcribe product SKUs incorrectly if those SKUs are phonetically ambiguous. As documented in call transcription accuracy benchmarks, keyword-level accuracy on the specific entities your downstream systems depend on is the metric that matters for QA automation and CRM data quality.
Quantifying KER lift in production
Global WER gives you the aggregate picture, but KER is where you measure whether vocabulary configuration is actually working on the terms that matter. Keyword Error Rate (KER) gives you a direct, domain-specific metric rather than a diluted aggregate. The formula is:
KER = (F + M) / N
Where N is the number of target keywords in the reference data, F is falsely recognized keywords, and M is missed keywords. To compute your lift:
- Transcribe your baseline sample without custom vocabulary and calculate KER for your target term set.
- Enable custom vocabulary with your curated list, transcribe the same sample, and recalculate KER.
- Express the delta as a percentage reduction in KER.
A reduction in KER on product names translates directly to fewer manual QA corrections on CRM entries, fewer misclassified interactions in your sentiment pipeline, and improved confidence in automated coaching scores.
Isolating domain terms for WER gains
B-WER measures accuracy on your biased entity set specifically, while U-WER measures accuracy on all other words. These metrics can help you understand whether your custom vocabulary is improving recognition of target terms without degrading performance on general vocabulary.
Reporting transcription accuracy lift
Translate KER improvements into operational outcomes your QA Director and VP of Operations already track: calculate aggregate QA hours saved from reduced manual correction time, track the drop in blank or mismatched CRM product fields, and measure improved compliance phrase detection rates for regulated disclosures. These metrics produce defensible data for your next vendor renewal or technology audit.
Preventing transcription errors in production
Managing vocabulary list constraints
Each provider imposes different technical limits on vocabulary configurations. Staying within those limits prevents silent degradation where the API accepts the list but truncates it at runtime.
| Provider |
Max list size (English) |
Max list size (other languages) |
Max entry length |
| Gladia (Solaria-1) |
Not explicitly capped; optimize for relevance |
Not explicitly capped; optimize for relevance |
Not explicitly stated; keep entries concise |
| AWS Transcribe |
Consult provider documentation |
Consult provider documentation |
Consult provider documentation |
| Rev.ai |
Consult provider documentation |
Consult provider documentation |
Consult provider documentation |
Check provider documentation for the latest limits and recommendations.
Calibrating your boost parameters
The intensity parameter controls how aggressively the decoder searches for a term. Setting intensity too high causes the model to surface the target term when the audio contains a phonetically similar but different word. Setting it too low reduces recall. The default intensity is 0.5. Test against your golden set and adjust based on whether terms are being missed or over-triggered.
Fixing errors in homophone transcription
For words with identical pronunciation but different spellings ("write" vs. "right"), the challenge is semantic disambiguation since the phoneme patterns are the same. Custom vocabulary can help the decoder prioritize the intended form in context, and custom spelling handles any residual orthographic standardization on the output. Use the two tools in sequence: vocabulary correction for phoneme-level handling, spelling correction for orthographic cleanup.
Troubleshooting common vocabulary setup issues
Setting custom vocabulary size limits
If your initial list exceeds recommended limits, split it by logical domain: one list per major product family, per call queue, or per language group for multilingual operations. For multi-tenant CCaaS platforms, maintain a vocabulary registry keyed by client or queue ID and load the relevant list at API call time. This segmentation simplifies the pruning and update cycle since each segment corresponds to a discrete operational area with its own review cadence.
Impact of custom terms on latency
We inject custom vocabulary at the decoder level during inference without retraining model weights, so overhead is minimal. Solaria-1 delivers partial transcription in under 103ms, and processing remains fast even with custom vocabulary enabled. For 10-minute call recordings, the vocabulary payload adds negligible time to overall job duration.
Can I use multiple vocabulary lists per account?
Yes. You can pass a different custom_vocabulary array per API call by constructing the request payload dynamically. For multi-tenant CCaaS platforms, maintain a vocabulary registry keyed by client or queue ID, then load the relevant list at API call time. This is consistent with our CRM integration guide, where per-call configuration is assembled dynamically from a centralized rules store.
What happens if a term appears in both custom vocabulary and spelling?
The safest practice is to avoid overlap: use custom vocabulary for phoneme-level correction and custom spelling for string normalization on the output. If a term requires both (common for branded acronyms), add it to custom vocabulary with a pronunciation hint and add expected misspellings to custom spelling separately. Test the combined configuration against your golden set to confirm the pipeline produces the intended output.
Pricing and feature availability
Custom vocabulary, diarization powered by pyannoteAI's Precision-2 model, translation across 100+ supported languages, and sentiment analysis are all included in our base rate: $0.61/hr for async on Starter and as low as $0.20/hr on Growth. No per-feature add-on fees appear after the fact, which matters when you are modeling cost-per-contact at scale. See our pricing page for the full tier breakdown.
FAQs
What is the difference between custom vocabulary and custom spelling?
Custom vocabulary adjusts the decoder's behavior during inference to favor target terms based on phoneme matching, making it effective for brand names, acronyms, and technical terms with unusual sound patterns. Custom spelling performs literal string replacement on the final transcript and requires explicit listing of each variant. For most contact center terms, use both in sequence: custom vocabulary for phoneme-level correction, custom spelling for orthographic standardization on the output.
How does the intensity parameter affect transcription accuracy?
The intensity parameter controls how aggressively the decoder searches for a
target term during inference. Setting it too high causes the model to surface the target term when the audio contains a phonetically similar but different word, producing false positives. Setting it too low reduces recall on the terms you care about. The default is 0.5. Test against your golden set and adjust based on whether terms are being missed or over-triggered.
When does runtime vocabulary injection outperform model fine-tuning?
Runtime vocabulary injection is the custom_vocabularybetter fit for most contact center teams: it deploys without retraining, updates instantly as products change, and is included in the base rate. Model fine-tuning makes sense when the vocabulary is stable, the domain has unique acoustic conditions not addressed by pronunciation hints, and the team has the labeled audio and engineering time to justify it. For dynamic vocabularies that change with quarterly product releases, runtime lists on Solaria-1 are the lower-cost and faster path.
Does enabling custom vocabulary affect transcription latency?
Custom vocabulary is injected at the decoder level during inference; model weights are not retrained. Processing speed remains consistent whether or not a custom vocabulary list is active.
How often should I update my custom vocabulary list?
Update the list on the same cadence as your product release cycle, and run a keyword error rate check against your golden test set after each update. Prune any terms that Solaria-1 already transcribes correctly at 100% accuracy, since redundant entries expand the decoder's search space without contributing accuracy gains.
What is word error rate (WER)?
Word error rate is the percentage of words in a transcript that differ from the reference text. For domain-specific QA, compute keyword WER on your target entity set rather than relying solely on global WER, which can mask critical failures on high-value terms.
Can Gladia's async diarization be combined with custom vocabulary?
Yes. Speaker diarization powered by pyannoteAI's Precision-2 model and custom vocabulary are both available in Gladia's async pipeline and can be enabled in the same request. Diarization is available in async workflows only, not in real-time transcription, so if speaker attribution is needed alongside vocabulary correction, the async workflow is the correct configuration.
Key terms glossary
Custom vocabulary: A configuration feature that adjusts a speech-to-text model's decoding behavior to improve recognition of specific technical terms, brand names, or acronyms. Some platforms support runtime injection while others require build-time configuration.
Custom spelling: A post-processing text normalization feature that performs literal string replacement on the final transcript to correct consistently misspelled or misformatted terms.
Acoustic confusion: A transcription error where the model maps an unfamiliar sound sequence to a phonetically similar but semantically incorrect word because the target term was not represented in its training data.
Keyword error rate (KER): A domain-specific accuracy metric calculated as (falsely recognized keywords + missed keywords) / total target keywords in the reference data, providing a targeted measure for contact center and compliance use cases.
Phoneme: The smallest unit of sound in a language that distinguishes one word from another. Phoneme-based vocabulary correction maps the target term's sound pattern into the decoder's search space to improve recognition.
Runtime injection: The capability to pass configuration parameters (such as a vocabulary list) to a model at inference time, enabling instant updates without redeployment cycles.