Your LLM summarization prompt isn't the problem. The reason your meeting notes look sloppy is that your speech-to-text layer is guessing at industry jargon. A transcript that misreads technical terms or brand names hands your downstream LLM a corrupted input it can't recover from, regardless of how well your prompt is written. This guide breaks down why ASR fails on specialized terms and how to implement custom vocabulary at the infrastructure layer using Gladia's async API, covering what that means for the unit economics of your audio pipeline.
Preventing errors in AI's specialized transcripts
The cost of entity recognition errors
A transcription error on a common word is annoying. A transcription error on a named entity is a data integrity failure. When the ASR model misreads a client name or product acronym, that error propagates to every system the transcript feeds: the CRM entry is wrong, the AI summary attributes an action item to the wrong company, and the coaching scorecard scores the wrong outcome.
Our async benchmark methodology demonstrates that Solaria-1 produces 39% fewer errors on key entities compared to leading competitors, and that gap compounds across every meeting your product processes.
Why ASR fails on brand names
Model developers train general-purpose ASR systems on large corpora of conversational speech, news broadcasts, and public audio datasets. The statistical frequency of "Kubernetes," "adalimumab," or "EBITDA" in those datasets is negligible compared to everyday words, so the acoustic model's probability distribution strongly favors common alternatives when it encounters unfamiliar phoneme sequences.
Five concrete failure patterns span every major industry vertical:
- Healthcare: "Adalimumab" gets rendered as "add-a-lim-you-mab" or dropped entirely, forcing clinical teams to manually correct transcripts before they're usable.
- Finance: "CDO" gets transcribed as "C-D-O" or confused with Chief Data Officer depending on surrounding context.
- Technology: "OAuth" surfaces as "oh auth." "kubectl" becomes "kube control" or "cube cuddle" depending on the speaker's accent.
- Legal: Latin terms like "res judicata" or "habeas corpus" are phonetically distant from anything in a standard training corpus.
- Manufacturing: Process acronyms like "APQP" or "JIT" get split, abbreviated incorrectly, or omitted entirely.
The challenge isn't model quality in isolation. It's that out-of-vocabulary words require explicit lexicon injection to be recognized reliably, and without it, the acoustic model substitutes the closest phonetic match it knows.
Custom vocab's NLP quality boost
When you provide a custom vocabulary list, the ASR engine biases its language model toward those terms at inference time. An n-gram language model estimates the probability distribution over groups of consecutive words, and by modifying that distribution to favor domain-specific terms, the model predicts the injected vocabulary as more likely when the acoustic signal is ambiguous. This approach improves entity accuracy, which means no custom model setup cost and no waiting weeks for a fine-tuning job to complete.
For product teams, this translates directly to downstream data quality. Every named entity injected correctly increases the reliability of your CRM syncs, AI-generated summaries, and NER output. For a deeper look at how this integrates into an async pipeline, see our meeting assistant architecture guide.
Gladia's approach to custom vocabulary
Managing your custom AI vocabulary
We accept custom vocabulary as an array in the transcription request payload. Each item can be a plain string or an object with optional properties including value (the term), intensity (how strongly to bias toward it), pronunciations (alternate phonetic representations), and language. This gives you fine-grained control over how aggressively the model favors a given term.
Here's how to pass a custom vocabulary array to the Gladia async transcription API:
{
"audio_url": "YOUR_AUDIO_FILE_URL",
"custom_vocabulary": [
"Kubernetes",
{
"value": "PostgreSQL",
"intensity": 0.8,
"language": "en"
},
{
"value": "OAuth",
"pronunciations": ["oh-auth"],
"intensity": 0.6
},
{
"value": "Acme Corporation",
"intensity": 0.7
}
],
"enable_code_switching""language_config": true,{
"code_switching": true
},
"diarization": true
}
The full parameter reference is in the custom vocabulary documentation.
Handling unique terms and custom glossaries
Solaria-1 applies custom vocabulary entries at inference time. The Solaria-1 architecture uses a multilingual model that handles language detection, which means the intensity bias applied to a custom term works across languages when the speaker switches mid-conversation.
This matters for technical product meetings where participants drop acronyms and brand names regardless of which language they're speaking at the time. You don't need separate vocabulary lists per language. One payload, one request, one billing event.
Integration with async transcription API
Custom vocabulary is part of Gladia's base feature set, not a metered add-on. On Starter ($0.61/hr async) and Growth plans, we bundle custom vocabulary alongside diarization, translation, NER, sentiment analysis, and summarization at the published per-hour rate. You can model your infrastructure costs at 1,000 hours per month or 10,000 hours per month and custom vocabulary incurs no additional charge. We publish the full cost breakdown in our async STT benchmark.
Crafting custom vocabulary lists for AI accuracy
Defining your product's unique jargon
The most effective starting point is an audit of your support tickets and existing transcripts. Run a frequency analysis on words appearing in customer-facing conversations that your current transcription outputs are getting wrong. Look for three categories: client and partner names, product names and version strings, and internal acronyms your team uses in every meeting. Those are the same entities your note-taker is most likely failing on.
Selecting high-value AI glossary terms
Not every uncommon word belongs in your custom vocabulary list. Our best practice guidance is to submit only words the model currently fails to transcribe correctly, not common words it handles fine. Sending redundant entries inflates your list without improving accuracy and increases the risk of false positives, where the model incorrectly biases toward your injected term in contexts where a different word was actually spoken.
- Prioritize OOV terms: Proper nouns, trademarked names, and acronyms the model has never encountered in training data. Send individual keywords, not full phrases, because the model handles phrase context natively and individual token injection is more precise.
- Avoid inflation: Don't send duplicates or common words the model already handles correctly. Validate your list on representative audio before rolling it out to your full production pipeline.
Formatting brand names and acronyms
Submit the term exactly as you want it to appear in the transcript output. For acronyms, use the spelled-out form if speakers pronounce every letter ("O-Auth") and use the full word form if speakers pronounce it as a word ("OAuth" as "oh-auth" with a pronunciation hint).
Managing code-switching for AI transcripts
Gladia's code-switching support handles mid-conversation language changes across all 100+ supported languages natively, with no separate configuration required. When you combine enable_code_switching:language_config: { code_switching: true } with a custom vocabulary list, the model applies your injected terms regardless of which language the surrounding speech is in. Our multilingual meeting transcription guide covers the accuracy benchmarks and implementation details for teams processing European and Asian language audio.
Prompt engineering for entity recognition
Customizing prompts for entity recognition
Once Solaria-1 delivers a custom-vocabulary-corrected transcript, the next step is structured entity extraction via Gladia's Audio-to-LLM pipeline. The accuracy of your injected terms in the transcript is what makes downstream extraction reliable. A correctly transcribed "PostgreSQL" gives the NER layer a clean token to work with. A hallucinated "Post Grass Q L" means the extraction step is operating on corrupted text, and even a well-tuned NER model will misclassify or miss it entirely.
Designing prompts for custom terms
Here is a concrete prompt pattern that uses the corrected Gladia transcript as input and extracts structured entities for CRM automation:
Extract all brand names, product acronyms, and technical terms from the
following transcript. Return a JSON object with arrays for: companies,
products, and technical_terms.
Company names to identify: [Acme Corporation, Oracle, AcmeCorp]
Product acronyms to flag: [OAuth, GDPR, EHR, SOX]
Technical terms: [Kubernetes, microservices, API gateway]
Transcript: "[GLADIA_TRANSCRIPT_OUTPUT]"
Return format:
{
"companies": [],
"products": [],
"technical_terms": []
}
The key principle is prompting the LLM to confirm entities you already know to look for, not discover arbitrary ones. This reduces hallucination risk at the extraction layer because the LLM does pattern matching against a known list rather than open-ended entity discovery.
Achieving reliable AI transcript output
Refining transcripts for brand accuracy
The Before/After contrast for custom vocabulary is straightforward. Without it, a generic ASR output for a SaaS sales call might contain errors on technical terms and brand names. With Gladia custom vocabulary applied, the same audio produces accurate transcription of terms like "Kubernetes," "PostgreSQL," and company names.
The corrected transcript produces a valid CRM entry, a coherent action item, and an NER output the pipeline can act on, while the hallucinated one produces corrupt data that flows silently into every connected system.
Recognizing varied brand pronunciations
Speakers with different native languages pronounce the same brand name differently. Solaria-1 handles accented speech detection natively, and for terms with a consistent non-standard pronunciation across your user base, add that variant to the pronunciations array in your vocabulary payload.
Managing custom vocabulary glossaries
Custom vocabulary lists need maintenance. Review your list quarterly and cross-reference it against entity errors surfacing in your support tickets. For teams processing audio in new markets or launching new product features, update the list before the feature ships, not after the first customer complaint arrives.
Confirming domain-specific term quality
Precision WER for key entities
Standard word error rate measures errors across all words in the transcript. For meeting note-takers, entity-level WER on injected terms is more operationally relevant. Run this calculation by isolating segments of your evaluation audio that contain custom vocabulary terms and measuring the error rate specifically on those segments.
Teams using Gladia's custom vocabulary report meaningful WER reductions on entity-heavy segments. Claap is one example of a team processing multi-language audio through Gladia's pipeline.
Custom vocabulary A/B test setup
A structured proof of concept for stakeholders runs as follows:
- Select representative audio: Choose 20-50 calls or meetings containing the terms you intend to inject.
- Run two transcription passes on the same audio: one without custom vocabulary and one with your vocabulary list active.
- Score entity-level accuracy for each injected term across both outputs.
- Map errors to downstream impact: Calculate how many CRM fields, action items, or NER extractions would have failed based on transcript errors in each pass.
- Report results by entity category (client names, product acronyms, technical terms) to show where the improvement is concentrated.
This structure gives you a clear ROI case tied to data integrity metrics your engineering and business stakeholders both care about.
Spotting false brand name matches
A high intensity value on a common phoneme sequence can produce false positives where the model substitutes your injected term when a different word was actually spoken. Per the custom vocabulary documentation, start with a moderate intensity value and adjust based on results. Lower it if you see too many false positives in your evaluation audio. Validate on a representative sample before scaling to production.
Resolving AI's brand name and jargon errors
Contextual disambiguation for custom terms
When two custom terms have similar phoneme sequences, the model uses surrounding context to disambiguate. Async transcription benefits from broader conversational context during processing, which is particularly valuable for meeting note-taker use cases where the conversation's topic signals which technical terms are likely to appear.
How AI confuses brand names and common words
The most common confusion patterns follow predictable acoustic logic:
| Spoken term |
Common ASR substitution |
Root cause |
| Kubernetes |
Similar-sounding phrases |
Unfamiliar phoneme cluster |
| PostgreSQL |
Word-by-word parsing |
Compound word split |
| OAuth |
Phonetic variants |
Acronym with ambiguous vowel |
| Acme Corporation |
Phonetic substitutions |
Short-vowel substitution |
| Adalimumab |
Syllable-by-syllable parsing |
Medical OOV term |
For all five patterns, adding the term to your custom vocabulary list with the correct spelling and a phonetic hint where pronunciation is ambiguous resolves the error reliably.
Ensuring accuracy with diverse accents
Solaria-1 supports accented speech across 100+ languages, including 42 languages not covered by other API-level STT providers. For meeting note-takers serving global user bases, this matters because custom vocabulary terms will be pronounced differently by speakers from different regions, and the model needs accent-robust acoustic recognition as the foundation before the vocabulary bias can work correctly.
Here's how Gladia compares to the two most common alternatives on the dimensions that matter most for custom vocabulary workflows:
| Capability |
Gladia |
Deepgram |
AssemblyAI |
| Custom vocabulary inclusion |
Bundled at base price |
Keyword boosting available |
Keyterms Prompting available |
| Language coverage |
100+ languages, 42 unique |
40+ languages (Nova-2), 60+ with Nova-3 |
99 languages |
| Code-switching |
Native, 100+ languages |
Supported on Nova-2 and Nova-3 |
Multilingual support available |
| Data training policy (paid plans) |
Growth/Enterprise: never used for training. Starter: can be used by default. |
Data processing agreement available |
Contact for data processing agreement |
| Async WER vs. competitors |
Industry-leading performance on conversational speech |
Baseline |
Baseline |
If you're migrating from either provider, our Deepgram migration guide and AssemblyAI migration guide cover the implementation details.
Start with 10 free hours and test custom vocabulary on your own domain-specific audio. See how Solaria-1 handles brand names, acronyms, and accented speakers before committing to a plan.
FAQs
How often should you update your custom vocabulary glossary?
Review quarterly and cross-reference against entity errors in your support tickets. Update immediately when launching new product features or entering new markets where unfamiliar brand names or terminology will appear in audio.
Does custom vocabulary add latency to async transcription?
Custom vocabulary is applied at inference time within the async pipeline. Gladia's async transcription typically processes one hour of audio in approximately one minute.
What intensity value should you start with for new terms?
Per the custom vocabulary documentation, start with a moderate intensity value and adjust upward only if terms aren't being picked up reliably. Lower it if you see false positives on representative audio before deploying to production.
What does Solaria-1 output when it can't match a spoken word to anything in the vocabulary list?
When a spoken word doesn't match anything in the vocabulary list, the model produces the closest phonetic match from its training distribution. Terms not in the custom vocabulary list still appear in the transcript based on the model's standard recognition capabilities.
Key terms glossary
ASR (Automatic Speech Recognition): Technology that converts spoken audio into written text automatically, also called speech-to-text or voice recognition. Modern ASR systems use acoustic models, language models, and pronunciation lexicons working together.
NLP (Natural Language Processing): The field of AI focused on how computers understand and process human language. In ASR pipelines, NLP models handle tasks like entity extraction, sentiment analysis, and summarization on top of transcribed text.
WER (Word Error Rate): The standard metric for ASR accuracy, calculated as the total number of substitutions, deletions, and insertions divided by the total number of words in the reference transcript. Lower WER indicates higher accuracy.
Jargon: Domain-specific vocabulary used within a professional community, including acronyms, brand names, and technical terms that rarely appear in general training corpora and are therefore high-risk for ASR substitution errors.
Custom vocabulary: A list of domain-specific terms injected into the ASR engine at inference time to bias the language model's probability distribution toward those terms. In Gladia's API, this is passed as an array of strings or objects in the transcription request payload.
Contextual embeddings: Numerical representations of words that encode their meaning based on surrounding context. Modern language models use contextual embeddings to disambiguate terms with similar phoneme sequences by analyzing the full conversational context around each word.