Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Get started

Speech-To-Text

Vonage call transcription: adding real-time speech-to-text to Vonage

TL;DR: Integrating our speech-to-text infrastructure with the Vonage Voice API replaces fragmented recording, transcription, and enrichment stacks with a single API. By routing Vonage WebSocket streams directly to our endpoint, contact centers achieve approximately 270ms real-time latency for live agent assistance, or use post-call batch processing for automated QA scoring. Streaming is the right choice for live superviso. Async is the right choice when speaker-attributed QA scoring and full call context matter more than latency.

Speech-To-Text

Key data extraction: accurately extracting names, account numbers, and intents from calls

TL;DR: Downstream contact center automation fails silently when the transcription layer misinterprets a name, transposes a digit, or attributes speech to the wrong speaker. Every QA scorecard, CRM entry, and coaching signal is ceiling-bounded by the accuracy of the layer beneath it. A wrong digit or phonetic name substitution propagates into every CRM field and compliance event that follows. Extraction precision is capped by transcription quality: Solaria-1 delivers on average 29% lower WER on conversational speech and 3x lower DER than alternatives, benchmarked across 8 providers, 7 datasets, and 74+ hours of audio.

Speech-To-Text

Amazon Connect transcription: real-time speech-to-text for AWS contact centers

TL;DR: Contact centers using Amazon Connect struggle with high transcription costs and poor multilingual accuracy when relying on native tools. Routing audio via Kinesis Video Streams or S3 to Solaria-1 eliminates the Lambda 15-minute timeout risk and removes per-feature add-on costs. On conversational speech, Solaria-1 delivers on average 29% lower WER than alternatives, benchmarked across 7 datasets and 74+ hours of audio.

Custom vocabulary support in speech-to-text: how to teach the model your terms

Published on June 26, 2026

by Ani Ghazaryan

TL;DR: Runtime vocabulary lists suit dynamic, frequently changing term sets; model fine-tuning is justified only for stable, fixed-domain vocabularies with unique acoustic conditions. The critical constraint is list length: adding terms that don't appear in your audio expands the decoder's search space and can degrade accuracy on entries that custom_spelling_config were already transcribed correctly. Measure improvement using keyword error rate on your specific entity set, not a global accuracy score, which masks failures on high-value terms. Update your vocabulary list on the same cadence as your product release cycle and prune any entries Solaria-1 already handles correctly.

When a generic speech-to-text model mishears "Salesforce CPQ" as "sales for SCPQ" or drops a proprietary product code entirely, the damage compounds fast. The CRM entry is wrong, the automated QA scorecard misses the product mention, and the LLM generating the post-call summary classifies the interaction as unresolved because the product name never appeared in the transcript. None of this fails loudly, and by the time a QA analyst catches it during manual sampling, the error has already propagated through dozens of calls.

This guide covers the API setup, list curation strategy, and measurement approach that contact center teams need to make custom vocabulary work in production.

When to use custom vocabulary in production

Solving transcription gaps for technical terms

Generic speech-to-text models can struggle with certain contact center conditions, producing acoustic confusion where the model maps an unfamiliar sound sequence to the closest phoneme chain it knows and produces plausible but wrong output. Intent routing can fail at the transcription layer before it reaches the natural language understanding (NLU) layer when transcription errors occur. QA automation is only as reliable as the transcription data feeding it, as call transcription accuracy benchmarks show. Custom vocabulary can help close the gap on domain-specific terms.

Mapping CRM entities for better accuracy

Custom vocabulary acts as a direct bridge between raw audio and structured CRM data. When an agent mentions a specific product tier, contract term, or internal team name, the STT layer needs to surface that exact string for downstream named entity recognition and LLM workflows to operate correctly. If the transcription misses it, your CRM gets populated with either a blank field or a hallucinated substitute.

Every downstream output is ceiling-bounded by STT accuracy. A wrong CRM entry, a wrong translation, and a wrong coaching score all trace back to a single transcription error at the first layer. For operations running automated QA on 100% of interactions, fixing errors at the vocabulary level is far cheaper than catching them downstream.

Vocabulary vs. spelling: choosing the right fix

Custom vocabulary and custom spelling solve different problems, and conflating them leads to misconfigured pipelines. Some providers use the terms interchangeably, but we treat them as distinct tools that work best in combination.

Feature	Custom vocabulary (phoneme-based)	Custom spelling (string replacement)
Primary mechanism	Adjusts decoding behavior to favor target terms during inference	Performs literal text string replacement on the final transcript
Best use case	Brand names, technical terms, acronyms	Correcting consistent misspellings, standardizing abbreviations
Coverage of phonetic variants	High (can catch acoustic variants of similar sounds)	Limited (requires explicit listing)
Provider support	Gladia (native), AWS Transcribe, Rev.ai	Gladia (native via `custom_spelling_config`), some others

‍

Custom spelling does not phonemize anything. It finds exact text matches and replaces them, which makes it safe from false positives but means it requires explicit listing of variants. Custom vocabulary can catch phonetic variants because similar sounds may map to the target output. For a contact center term like "XB-7 Pro," you want custom vocabulary handling phoneme matching and custom spelling normalizing the output format.

For true homophones (words that are pronounced identically, like "write" and "right"), the challenge is semantic disambiguation rather than phonetic differentiation, since both forms share the same phoneme pattern. Custom vocabulary biases the decoder toward specific phoneme sequences but cannot resolve which spelling the speaker intended when multiple valid options exist. Add the target term to your vocabulary list and use custom spelling for any downstream orthographic cleanup needed.

API integration for custom term sets

Secure API access for custom terms

One operational constraint applies across all providers before you build your term list: as a best practice, avoid entering personally identifiable information (PII) or protected health information (PHI) into custom vocabulary files. Use anonymized tokens for any terms that might intersect with customer identifiers, then post-process results to reinsert sensitive values from a secure vault if needed.

Optimizing custom term lists for scale

The choice between runtime vocabulary lists and full model fine-tuning comes down to deployment speed, cost, and the nature of the vocabulary problem.

Approach	Development effort	Accuracy lift on jargon	Cost	Best for
Runtime lists (Gladia, AWS)	Fast to deploy	Can improve targeted term recognition	Included in base rate (Gladia)	Dynamic vocabularies that change with product releases
Model fine-tuning	Requires labeled domain audio	Can improve acoustic environment adaptation	Significant (infrastructure and engineering time)	Fixed-domain, stable vocabulary with unique acoustic conditions

‍

Runtime vocabulary injection biases the decoder's language model at inference time without retraining weights. This is why our async pipeline maintains fast processing speeds even with a custom term list loaded.

Teams already using OpenAI's stack frequently evaluate the Whisper API as a candidate for domain vocabulary work, making it worth addressing directly: the Whisper API has no native word-boosting system, offering only a prompt parameter limited to 224 tokens of spelling hints. That ceiling is insufficient for domain vocabularies of any meaningful size. It can also hallucinate on telephony audio, a known failure mode on 8kHz call recordings that Solaria-1 is specifically designed to avoid in production environments.

Boosting recall for specialized terminology

The custom_vocabulary parameter in Gladia's API accepts the following object structure for each term:

{
  "audio_url": "https://your-storage.example.com/call-recording.wav",
  "custom_vocabulary": [
    "XB-7 Pro",
    { "value": "Salesforce CPQ" },
    {
      "value": "Westeros",
      "pronunciations": ["Wes-teh-ros"],
      "intensity": 0.5,
      "language": "en"
    }
  ]
}

The value field holds the correctly spelled target term. The intensity property controls how strongly the decoder biases toward that term during the acoustic search, and the pronunciations array accepts phonetic spelling hints. Review the full parameter reference in our custom vocabulary docs before configuring intensity values at scale, as setting intensity too high can cause the model to surface a target term when the audio contains a phonetically similar but different word.

A Python implementation using the standard requests library follows this pattern:

import requests

headers = {
    "x-gladia-key": "YOUR_API_KEY",
    "Content-Type": "application/json"
}

payload = {
    "audio_url": "https://your-storage.example.com/call.wav",
    "diarization": True,
    "custom_vocabulary": [
        "XB-7 Pro",
        {"value": "Salesforce CPQ"},
        {
            "value": "TurboTax Advantage",
            "pronunciations": ["Turbo Tax Advantidge"],
            "intensity": 0.5
        }
    ]
}

response = requests.post(
    "https://api.gladia.io/v2/transcription",
    headers=headers,
    json=payload
)

print(response.json())

Pronunciation hints for ambiguous terms

The pronunciations array handles terms where standard orthography does not reflect how a word sounds in conversational telephony audio. For a brand name like "Xiaomi" spoken by a bilingual agent, passing a phonetic approximation like "Shao-mee" helps the decoder align the incoming acoustic pattern to the target output. Stack multiple pronunciation variants in the array when the same term is used by agents with different accent backgrounds, which is common in BPO environments where a single queue serves agents across the Philippines, India, and Mexico. For multilingual term sets, review our code-switching documentation for handling mid-sentence language transitions, which compounds the pronunciation challenge.

Curating custom lists from existing call logs

Automating lists from CRM and SKU data

The fastest way to build an initial vocabulary list is to pull it directly from your CRM's product catalog and SKU tables. Most CCaaS platforms expose a REST endpoint or database view of active product names, and a simple ETL job can extract, deduplicate, and format these as a JSON array. Include:

Active product names and version numbers
Compliance disclosure phrases that must be transcribed verbatim
Frequently mentioned partner integrations (for example, Salesforce, Zendesk, HubSpot)
Competitor names that agents are trained to mention by policy

How to isolate transcription errors

Mining existing transcripts for acoustic confusion patterns extends your vocabulary list beyond the initial CRM pull. The process follows three steps:

Collect a sample: Pull a representative set of transcripts from recent calls covering your highest-volume product lines and key operational regions.
Run a diff against ground truth: If you have human-verified transcripts from QA sampling, compute word error rate on domain-specific terms rather than the full transcript. Pay attention to substitution errors, where the model produced a real word instead of the target, since these are harder to catch in downstream analysis.
Cluster by error pattern: Group errors by the target term being misrecognized. "SaaS" transcribed as "sass" or "sauce" are examples of phoneme-level confusion that custom vocabulary resolves in one entry. Custom vocabulary vs. custom spelling notes that the teams producing the cleanest transcripts are the ones who diagnose each error before configuring anything, rather than dumping their entire product catalog into the API.

Selecting terms for maximum impact

Longer vocabulary lists are not better. Additional terms that do not appear in the audio can degrade word error rate because the decoder expands its search space unnecessarily. Keep lists short and targeted for optimal performance.

Apply a simple prioritization filter before submitting your list:

Frequency: Prioritize terms that appear regularly in your calls over rare terms that have minimal impact on overall accuracy.
Error rate on the base model: Test Solaria-1 against your sample set first. Remove any terms the model already transcribes correctly across your audio conditions.
Downstream impact: Prioritize terms that feed CRM fields, QA scorecard categories, or compliance disclosure tracking. A missed compliance phrase in a regulated disclosure is a liability, not just an accuracy gap.

Optimizing your vocabulary strategy for evolving CX needs

Version control for vocabulary lists

Product names change, SKUs retire, and new service tiers launch on quarterly cycles. Many teams manage vocabulary lists using version control and standard deployment practices to track changes and enable rollbacks if a new term set introduces a regression.

Every vocabulary update should be tested against a golden test set before deployment. The test set is a small corpus of annotated transcripts covering your highest-priority term categories. After adding new terms, run the updated list against the test set and compare keyword WER on existing terms before and after. If accuracy on existing terms drops, a new entry is likely phonetically similar to an established term and needs a pronunciation hint to distinguish it. Our async pipeline maintains fast processing speeds, so you can iterate through test set runs quickly without blocking a release.

Eliminating redundant model terms

Solaria-1 handles a range of common technical terms correctly out of the box. Before adding a term to your custom vocabulary list, test the base model against your sample audio. Any term achieving 100% accuracy across your test conditions can be removed from the list, keeping your API payloads lean and reducing the probability of interference between adjacent vocabulary entries. This pruning habit is especially important for multilingual contact center operations where the same call queue handles speakers across 5-10 languages, each with a different baseline accuracy profile.

Measuring custom vocabulary impact on WER

Baseline WER on telephony calls without custom vocabulary

Establish a baseline before enabling custom vocabulary to quantify the improvement and build a credible ROI case for leadership. Test against your actual production environment: 8kHz telephony audio, realistic background noise, and a representative sample of agent accents from your BPO regions. Global WER captures aggregate performance but may mask domain-specific failures. A model achieving high global word accuracy can still transcribe product SKUs incorrectly if those SKUs are phonetically ambiguous. As documented in call transcription accuracy benchmarks, keyword-level accuracy on the specific entities your downstream systems depend on is the metric that matters for QA automation and CRM data quality.

Quantifying KER lift in production

Global WER gives you the aggregate picture, but KER is where you measure whether vocabulary configuration is actually working on the terms that matter. Keyword Error Rate (KER) gives you a direct, domain-specific metric rather than a diluted aggregate. The formula is:

KER = (F + M) / N

Where N is the number of target keywords in the reference data, F is falsely recognized keywords, and M is missed keywords. To compute your lift:

Transcribe your baseline sample without custom vocabulary and calculate KER for your target term set.
Enable custom vocabulary with your curated list, transcribe the same sample, and recalculate KER.
Express the delta as a percentage reduction in KER.

A reduction in KER on product names translates directly to fewer manual QA corrections on CRM entries, fewer misclassified interactions in your sentiment pipeline, and improved confidence in automated coaching scores.

Isolating domain terms for WER gains

B-WER measures accuracy on your biased entity set specifically, while U-WER measures accuracy on all other words. These metrics can help you understand whether your custom vocabulary is improving recognition of target terms without degrading performance on general vocabulary.

Reporting transcription accuracy lift

Translate KER improvements into operational outcomes your QA Director and VP of Operations already track: calculate aggregate QA hours saved from reduced manual correction time, track the drop in blank or mismatched CRM product fields, and measure improved compliance phrase detection rates for regulated disclosures. These metrics produce defensible data for your next vendor renewal or technology audit.

Preventing transcription errors in production

Managing vocabulary list constraints

Each provider imposes different technical limits on vocabulary configurations. Staying within those limits prevents silent degradation where the API accepts the list but truncates it at runtime.

Provider	Max list size (English)	Max list size (other languages)	Max entry length
Gladia (Solaria-1)	Not explicitly capped; optimize for relevance	Not explicitly capped; optimize for relevance	Not explicitly stated; keep entries concise
AWS Transcribe	Consult provider documentation	Consult provider documentation	Consult provider documentation
Rev.ai	Consult provider documentation	Consult provider documentation	Consult provider documentation

‍

Check provider documentation for the latest limits and recommendations.

Calibrating your boost parameters

The intensity parameter controls how aggressively the decoder searches for a term. Setting intensity too high causes the model to surface the target term when the audio contains a phonetically similar but different word. Setting it too low reduces recall. The default intensity is 0.5. Test against your golden set and adjust based on whether terms are being missed or over-triggered.

Fixing errors in homophone transcription

For words with identical pronunciation but different spellings ("write" vs. "right"), the challenge is semantic disambiguation since the phoneme patterns are the same. Custom vocabulary can help the decoder prioritize the intended form in context, and custom spelling handles any residual orthographic standardization on the output. Use the two tools in sequence: vocabulary correction for phoneme-level handling, spelling correction for orthographic cleanup.

Troubleshooting common vocabulary setup issues

Setting custom vocabulary size limits

If your initial list exceeds recommended limits, split it by logical domain: one list per major product family, per call queue, or per language group for multilingual operations. For multi-tenant CCaaS platforms, maintain a vocabulary registry keyed by client or queue ID and load the relevant list at API call time. This segmentation simplifies the pruning and update cycle since each segment corresponds to a discrete operational area with its own review cadence.

Impact of custom terms on latency

We inject custom vocabulary at the decoder level during inference without retraining model weights, so overhead is minimal. Solaria-1 delivers partial transcription in under 103ms, and processing remains fast even with custom vocabulary enabled. For 10-minute call recordings, the vocabulary payload adds negligible time to overall job duration.

Can I use multiple vocabulary lists per account?

Yes. You can pass a different custom_vocabulary array per API call by constructing the request payload dynamically. For multi-tenant CCaaS platforms, maintain a vocabulary registry keyed by client or queue ID, then load the relevant list at API call time. This is consistent with our CRM integration guide, where per-call configuration is assembled dynamically from a centralized rules store.

What happens if a term appears in both custom vocabulary and spelling?

The safest practice is to avoid overlap: use custom vocabulary for phoneme-level correction and custom spelling for string normalization on the output. If a term requires both (common for branded acronyms), add it to custom vocabulary with a pronunciation hint and add expected misspellings to custom spelling separately. Test the combined configuration against your golden set to confirm the pipeline produces the intended output.

Pricing and feature availability

Custom vocabulary, diarization powered by pyannoteAI's Precision-2 model, translation across 100+ supported languages, and sentiment analysis are all included in our base rate: $0.61/hr for async on Starter and as low as $0.20/hr on Growth. No per-feature add-on fees appear after the fact, which matters when you are modeling cost-per-contact at scale. See our pricing page for the full tier breakdown.

FAQs

What is the difference between custom vocabulary and custom spelling?

Custom vocabulary adjusts the decoder's behavior during inference to favor target terms based on phoneme matching, making it effective for brand names, acronyms, and technical terms with unusual sound patterns. Custom spelling performs literal string replacement on the final transcript and requires explicit listing of each variant. For most contact center terms, use both in sequence: custom vocabulary for phoneme-level correction, custom spelling for orthographic standardization on the output.

How does the intensity parameter affect transcription accuracy?

The intensity parameter controls how aggressively the decoder searches for a

target term during inference. Setting it too high causes the model to surface the target term when the audio contains a phonetically similar but different word, producing false positives. Setting it too low reduces recall on the terms you care about. The default is 0.5. Test against your golden set and adjust based on whether terms are being missed or over-triggered.

When does runtime vocabulary injection outperform model fine-tuning?

Runtime vocabulary injection is the custom_vocabularybetter fit for most contact center teams: it deploys without retraining, updates instantly as products change, and is included in the base rate. Model fine-tuning makes sense when the vocabulary is stable, the domain has unique acoustic conditions not addressed by pronunciation hints, and the team has the labeled audio and engineering time to justify it. For dynamic vocabularies that change with quarterly product releases, runtime lists on Solaria-1 are the lower-cost and faster path.

Does enabling custom vocabulary affect transcription latency?

Custom vocabulary is injected at the decoder level during inference; model weights are not retrained. Processing speed remains consistent whether or not a custom vocabulary list is active.

How often should I update my custom vocabulary list?

Update the list on the same cadence as your product release cycle, and run a keyword error rate check against your golden test set after each update. Prune any terms that Solaria-1 already transcribes correctly at 100% accuracy, since redundant entries expand the decoder's search space without contributing accuracy gains.

What is word error rate (WER)?

Word error rate is the percentage of words in a transcript that differ from the reference text. For domain-specific QA, compute keyword WER on your target entity set rather than relying solely on global WER, which can mask critical failures on high-value terms.

Can Gladia's async diarization be combined with custom vocabulary?

Yes. Speaker diarization powered by pyannoteAI's Precision-2 model and custom vocabulary are both available in Gladia's async pipeline and can be enabled in the same request. Diarization is available in async workflows only, not in real-time transcription, so if speaker attribution is needed alongside vocabulary correction, the async workflow is the correct configuration.

Key terms glossary

Custom vocabulary: A configuration feature that adjusts a speech-to-text model's decoding behavior to improve recognition of specific technical terms, brand names, or acronyms. Some platforms support runtime injection while others require build-time configuration.

Custom spelling: A post-processing text normalization feature that performs literal string replacement on the final transcript to correct consistently misspelled or misformatted terms.

Acoustic confusion: A transcription error where the model maps an unfamiliar sound sequence to a phonetically similar but semantically incorrect word because the target term was not represented in its training data.

Keyword error rate (KER): A domain-specific accuracy metric calculated as (falsely recognized keywords + missed keywords) / total target keywords in the reference data, providing a targeted measure for contact center and compliance use cases.

Phoneme: The smallest unit of sound in a language that distinguishes one word from another. Phoneme-based vocabulary correction maps the target term's sound pattern into the decoder's search space to improve recognition.

Runtime injection: The capability to pass configuration parameters (such as a vocabulary list) to a model at inference time, enabling instant updates without redeployment cycles.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Speech-To-Text

Vonage call transcription: adding real-time speech-to-text to Vonage

Speech-To-Text

Key data extraction: accurately extracting names, account numbers, and intents from calls

Speech-To-Text

Amazon Connect transcription: real-time speech-to-text for AWS contact centers

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Read more

Vonage call transcription: adding real-time speech-to-text to Vonage

Key data extraction: accurately extracting names, account numbers, and intents from calls

Amazon Connect transcription: real-time speech-to-text for AWS contact centers

Custom vocabulary support in speech-to-text: how to teach the model your terms

When to use custom vocabulary in production

Solving transcription gaps for technical terms

Mapping CRM entities for better accuracy

Vocabulary vs. spelling: choosing the right fix

API integration for custom term sets

Secure API access for custom terms

Optimizing custom term lists for scale

Boosting recall for specialized terminology

Pronunciation hints for ambiguous terms

Curating custom lists from existing call logs

Automating lists from CRM and SKU data

How to isolate transcription errors

Selecting terms for maximum impact

Optimizing your vocabulary strategy for evolving CX needs

Version control for vocabulary lists

Eliminating redundant model terms

Measuring custom vocabulary impact on WER

Baseline WER on telephony calls without custom vocabulary

Quantifying KER lift in production

Isolating domain terms for WER gains

Reporting transcription accuracy lift

Preventing transcription errors in production

Managing vocabulary list constraints

Calibrating your boost parameters

Fixing errors in homophone transcription

Troubleshooting common vocabulary setup issues

Setting custom vocabulary size limits

Impact of custom terms on latency

Can I use multiple vocabulary lists per account?

What happens if a term appears in both custom vocabulary and spelling?

Pricing and feature availability

FAQs

What is the difference between custom vocabulary and custom spelling?

How does the intensity parameter affect transcription accuracy?

When does runtime vocabulary injection outperform model fine-tuning?

Does enabling custom vocabulary affect transcription latency?

How often should I update my custom vocabulary list?

What is word error rate (WER)?

Can Gladia's async diarization be combined with custom vocabulary?

Key terms glossary

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.