Teams building CCaaS platforms, shipping meeting assistants, or running contact center operations often encounter unexpected challenges when evaluating transcription APIs on English accuracy benchmarks and then scaling to multilingual markets. Modern platforms vary in their handling of code-switching when bilingual support agents switch languages mid-call, and pricing structures differ across vendors for features like diarization, translation, and entity extraction. This guide compares Tier 1 full-stack platforms and Tier 2 API providers across language coverage, code-switching, accent handling, and pricing at realistic production volumes, so you can make a defensible architecture decision before your unit economics break at scale.
Key metrics for multilingual call analytics selection
Before comparing vendors, you need agreement on what you're actually measuring. English WER in a studio environment may not reliably predict production performance on calls with background noise, multiple speakers, or code-switching scenarios common in global contact centers.
Language-specific accuracy
A provider's headline language count is a marketing number, not a production metric. There are two distinct dimensions that actually matter: breadth and depth.
Breadth means how many languages a model supports at all. Depth means how accurately it transcribes a specific language under real conditions: specific accents, noisy audio, code-switching mid-sentence.
These are not the same thing, and optimising for one does not guarantee the other. A model that nominally supports 100 languages may have trained primarily on English and a handful of European languages. WER on Tagalog, Bengali, or Urdu can be materially higher than headline accuracy figures suggest.
Gladia runs two models that address these dimensions differently:
Solaria-1 is the breadth model: 100+ languages, true mid-conversation code-switching, real-time streaming. Best for global multilingual products, rare languages, and contact centers with diverse caller demographics across Southeast Asia, South Asia, and Latin America.
Solaria-3 is the depth model: highest accuracy on real-world European business audio across English, French, German, Spanish, and Italian. Best for contact centers and CCaaS platforms serving European markets with noisy, accented, or conversational audio. Solaria-3ranks #1 against AssemblyAI, ElevenLabs, Deepgram, Mistral, and Speechmatics on real customer recordings: 6.4% WER on Earnings22 financial calls (only model under 7%), 33.9% on Switchboard conversational speech (only model under 35%). Solaria-3 is async only.
The real question is what WER looks like on your specific language, accent, and audio condition. The only reliable answer comes from testing on your own production recordings.
Accuracy for diverse accents
Accent handling is a training data problem. Models trained primarily on specific accent groups may produce elevated WER on diverse accents even when the language is nominally "supported." Contact centers processing calls from regions with distinct accents need to test WER on audio from those regions specifically, not rely on aggregate benchmark numbers that may weight certain datasets heavily.
Latency and throughput trade-offs
The threshold that matters depends on the use case. If your workflow is post-call analytics, processing 10 minutes of audio quickly is the relevant metric. Live agent assist and real-time applications typically require final transcript latency under 300ms to avoid disrupting conversation flow. Gladia supports both: async processing and real-time transcription at approximately 300ms latency, with async as the primary strength for call analytics workflows.
Unit economics for AI platforms
The total cost of ownership calculation must include every feature you'll actually use in production. Base rate comparisons look competitive until diarization, translation, sentiment, and NER each carry a separate line item and the final per-hour cost is 2-3x the advertised headline. Model your costs at 1,000, 10,000, and 100,000 hours per month with all features enabled before committing.
Tier 1: Full-stack call analytics platforms
Full-stack platforms give you a pre-built analytics UI, workflow tools for agent coaching, and QA dashboards built for contact center operations teams who want insights without a development investment. The trade-off is pricing structures that may not flex well with audio volume, and language depth documentation that typically lags dedicated STT infrastructure.
CallMiner: unit costs and language support
CallMiner is an enterprise conversation analytics platform with modules for agent coaching and live guidance, with integration APIs for third-party telephony systems.
Where it fits: Large enterprise contact centers that need compliance-focused QA with human review workflows.
Where it falls short: CallMiner pricing is not listed on their public website as of this writing. Contact their sales team directly for cost modeling at your target volume.
Genesys: multilingual accuracy benchmarks
Genesys delivers conversation intelligence within Genesys Cloud for sentiment, empathy, and topic detection. The native integration between speech analytics and workforce engagement management means QA scoring and coaching feed directly from transcription without additional integration work.
Where it fits: Contact centers already on Genesys Cloud who need analytics tightly coupled to their routing and WEM workflows.
Where it falls short: Genesys analytics is an extension of the Genesys platform. If your call recording infrastructure sits outside Genesys, or you're building a custom CCaaS product, that coupling becomes a constraint.
Observe.AI: multilingual accent support and latency
Observe.AI is a conversation intelligence platform that automates QA scoring, enabling performance evaluation across interaction volumes that would overwhelm human QA teams.
Where it fits: Mid-market contact centers automating QA scoring and reducing sampling bias in performance reviews.
Where it falls short: Teams targeting South Asian or Southeast Asian language traffic should test multilingual performance and code-switching capability specifically before committing.
Why choose end-to-end AI analytics?
Full-stack platforms make sense when you're buying a solution for a contact center operations team, not building infrastructure for a product. When your goal is to ship a custom CCaaS feature, a meeting assistant, or a post-call analytics pipeline as part of your own product, a full-stack platform becomes a dependency that limits both your flexibility and your unit economics at scale.
API accuracy for multilingual call transcription
API-first providers give you the transcription and audio intelligence layer, and you build the product logic on top. This adds development work, but you control the pipeline, own the cost structure, and can optimize for the specific audio conditions your users generate.
| Platform |
Target audience |
Pricing model |
Code-switching support |
| Gladia |
CCaaS builders, meeting assistants, voice agents |
Starter $0.61/hr async, Growth as low as $0.20/hr, all features at base rate |
100+ languages, automatic |
| AssemblyAI |
Developer teams |
Base rate plus per-feature add-ons |
Universal-Streaming: 6 languages (EN, ES, FR, DE, IT, PT), intra-utterance |
| Deepgram |
Voice pipeline developers |
Per-minute; diarization billed separately (~$0.0020/min) |
Flux Multilingual: 10 languages, native mid-call |
Multilingual accuracy via Gladia's code-switching
Solaria-1 covers 100+ languages, with automatic code-switching detection working across the full language set in both async and real-time modes. When a speaker shifts from English to Tagalog mid-sentence, Solaria-1 detects the change automatically without a language parameter reset. The same applies to Bengali, Punjabi, Tamil, Urdu, Persian, and Marathi, which are primary traffic languages for South Asia and Southeast Asian BPO operations, not edge cases.
Claap, a video collaboration platform for product teams, reports 1-3% WER in production on Gladia's infrastructure.
Here's what a typical Gladia API response looks like when Solaria-1 detects a language switch in a call recording:
{
"utterances": [
{
"speaker": 0,
"start": 0.0,
"end": 3.5,
"text": "So the issue is with your last invoice,",
"language": "en",
"confidence": 0.97
},
{
"speaker": 1,
"start": 3.6,
"end": 7.2,
"text": "si, el total no es correcto para este mes",
"language": "es",
"confidence": 0.95
}
]
}
Each utterance carries a language label alongside speaker attribution from pyannoteAI's Precision-2 diarization model.
AssemblyAI: multilingual WER benchmarks
AssemblyAI offers speech-to-text models with automatic language detection, and its Universal-Streaming model handles six languages (English, Spanish, French, German, Italian, and Portuguese) with intra-utterance code-switching. Teams moving an existing AssemblyAI integration across can map the API and parameter differences to Gladia using the AssemblyAI-to-Gladia migration guide directly.
Pricing reality: AssemblyAI offers a base transcription rate with separate feature add-ons. Check current pricing documentation for the full feature set your production pipeline requires.
Data policy: Review AssemblyAI's data usage terms for model training policies and opt-out options.
Deepgram: low-latency live transcription
Deepgram's Nova-3 delivers sub-300ms streaming latency via WebSocket per its documentation, which makes it a candidate for voice agent pipelines where STT output feeds directly into an LLM response loop. Its Flux Multilingual model adds native mid-call code-switching across 10 languages. Per Deepgram's public pricing, speaker diarization is billed as a separate add-on (about $0.0020/min, roughly $0.12/hour) on top of the Nova-3 base rate, so teams evaluating both can size the API differences when migrating from Deepgram to Gladia against that cost.
Data policy: Review Deepgram's data usage terms for model training policies and opt-out mip_opt_out=true.
Code-switching: native feature vs. afterthought
Code-switching support on most ASR platforms was added as an afterthought to a model architecture designed for monolingual audio. Understanding the technical distinction between genuine code-switching detection and simple multi-language transcription is one of the most consequential decisions a product team makes during vendor selection.
Intra-sentence code-switching support
The hardest version of code-switching is intra-sentence: a speaker completes half a thought in one language and finishes it in another. This is common across South Asia, Southeast Asia, and Latin America, and it's where most transcription APIs fail. A model requiring a language parameter at session initiation cannot handle this at all. A model detecting language per utterance will segment the switch but may lose context at the boundary. Solaria-1's code-switching detection works automatically across all supported languages, which is why code-switching failures in contact centers carry such direct downstream cost for global CCaaS builders.
The only reliable validation test is your own production audio. Take 10-20 hours of real call recordings from your highest-traffic non-English markets, run them through each candidate API with all features enabled, and compute WER against a human-verified transcript. Focus specifically on entity accuracy: wrong names, phone numbers, and product codes produce silent failures downstream in CRM writes and coaching scores.
Improving code-switching accuracy
Code-switching compounds entity recognition errors because the model must simultaneously handle the language transition and resolve domain-specific terms such as product names, brand names, and industry jargon that may appear in either language, often with non-standard spellings or romanizations. When a speaker shifts from English to Tagalog mid-sentence while naming an insurance product, the model is managing both the language boundary and an out-of-vocabulary term at the same moment.
Custom vocabulary lets you inject product names, brand terms, and industry-specific phrases that the base model may not have encountered in training data. For contact centers in insurance, financial services, or healthcare, custom vocabulary improves entity accuracy directly. Custom spelling handles brand names with non-standard romanizations. Both features are included in Gladia's Starter and Growth plans without additional fees.
Global language and accent coverage
A language count is a marketing number. What matters is which languages are supported at production-grade accuracy, and whether those languages include the specific ones your users actually speak.
| Platform |
Code-switching languages |
Code-switching type |
Unique languages |
Data retraining default |
| Gladia (Solaria-1) |
100+ languages supported |
Automatic, async and real-time |
42 exclusive languages (see full list) |
Not used on Growth/Enterprise (no opt-out required) |
| Gladia (Solaria-3) |
EN, FR, DE, ES, IT |
Async only |
EN, FR, DE, ES, IT (depth model, not breadth) |
Not used on Growth/Enterprise (no opt-out required) |
| AssemblyAI (Universal-Streaming) |
6 (EN, ES, FR, DE, IT, PT) |
Intra-utterance, single model |
Not published |
Trains by default, opt-out on paid plans only |
| Deepgram (Flux Multilingual) |
10 |
Native mid-call |
Not published |
Model Improvement Program by default; opt-out via mip_opt_out=true per request |
| CallMiner |
Not published |
Not published |
Not published |
Not published |
| Genesys |
Not published |
Within Genesys Cloud |
Not published |
Review vendor terms |
| Observe.AI |
Not published |
Not published |
Not published |
Review vendor terms |
The 42 languages exclusive to Gladia include Tagalog, Bengali, Punjabi, Tamil, Urdu, Persian, Marathi, Haitian Creole, Maori, and Javanese. For contact centers running BPO operations in the Philippines, India, or Indonesia, these are primary traffic languages, not edge cases.
Automatic language detection in Solaria-1 is designed to identify language correctly despite strong accents, addressing a common failure point in multilingual transcription systems.
Understanding the real cost: base rates and add-ons
The table below shows base rates and how diarization is billed, since a production call analytics pipeline typically needs diarization plus sentiment and summarization. Adding entity detection to AssemblyAI pushes the effective rate higher.
API per-hour costs and add-on structure
| Provider |
Base rate (async) |
Diarization |
Notes |
| Gladia Starter |
$0.61/hr |
Included |
All audio intelligence at base rate |
| Gladia Growth |
As low as $0.20/hr |
Included |
All features included, volume commitment |
| AssemblyAI |
Base rate plus per-feature add-ons |
Billed separately |
Sentiment and summarization also add-ons |
| Deepgram Nova-3 |
~$0.46/hr monolingual, ~$0.55/hr multilingual ($0.0077–$0.0092/min) |
+~$0.12/hr ($0.0020/min) |
Per Deepgram's public pricing |
When comparing pricing, model your costs at realistic production volumes with all features enabled. Some platforms charge separately for diarization, sentiment, and summarization, while others bundle these features at the base rate.
Accuracy benchmarks from independent tests
Vendor-published accuracy numbers usually come from curated studio recordings, which is why they rarely predict production performance. An open, reproducible benchmark on realistic audio is a better starting point for evaluation.
Gladia's open-sourceasync benchmark evaluates Solaria-1 against eight providers across seven datasets and 74+ hours of audio, using realistic production conditions (background noise, accented speech, multi-speaker calls) rather than controlled datasets. On that basis, Solaria-1 delivers on average 29% lower WER and 3x lower DER on conversational speech than alternatives.
Key considerations when evaluating multilingual call analytics platforms
What WER should I expect on noisy call center audio?
WER on noisy contact center audio is consistently higher than clean-studio benchmarks suggest. As a reference point: on the Earnings22 financial calls dataset (real business audio), Solaria-3 achieves 6.4% WER, the only model under 7% in its benchmark cohort. On Switchboard conversational speech, Solaria-3 reaches 33.9% WER, the only model under 35%. For real customer English recordings, Solaria-3 reaches 9.6% WER vs 9.9% for ElevenLabs, 10.0% for AssemblyAI, and 10.7% for Deepgram.
In production, Claap reports 1–3% WER on meeting audio through Gladia. The acceptable threshold depends on the downstream use case: CRM data population requires lower WER than summarization, because errors in named entities and numbers compound into data quality failures. For numerical accuracy on policy numbers and financial data, your test set must include real telephony recordings from your production queues — clean-audio benchmarks do not predict this.
API integration timeframes
Production integrations move quickly on the Gladia Python and JavaScript SDKs, with native support for Pipecat, LiveKit, Twilio, and Retell documented in the API reference. Direct Slack access to Gladia engineers compresses evaluation cycles from weeks to days.
"The team is very reactive and helpful. The product works great." - Robin L. on G2
Data usage policy
Default behavior matters more than opt-out availability.
- Gladia: Customer data can be used for model training by default. Gladia Growth and Enterprise: Customer data is never used for model training, and no opt-out action is required.
- AssemblyAI: Customer data is licensed for model training by default. Opt-out is available on paid plans only, must be actively configured, and applies going forward.
- Deepgram: Audio is enrolled in the Model Improvement Program by default. Opting out requires adding
mip_opt_out=true to every API request, which forgoes the standard pricing discount.
How does Tagalog and Hindi transcription perform?
Gladia's Solaria-1 covers Tagalog, Hindi, Bengali, Punjabi, Tamil, Urdu, and Marathi, languages not available in any other API-level STT provider. For contact centers with Philippines or India BPO operations, this means competitors nominally "supporting" these languages often have insufficient training data depth, producing elevated WER on regional accents and colloquial speech. The only reliable validation is testing on production recordings from those regions. Run 5–10 hours of real calls through each candidate API and measure WER against a human-verified transcript, with specific focus on entity accuracy (names, numbers, product codes) rather than aggregate WER alone. The full language list is at supported languages documentation.
Start with 10 free hours and test Gladia on your own multilingual audio, with Gladia engineers on Slack for edge cases.
FAQs
What is the difference between code-switching and multilingual transcription?
Multilingual transcription means a model can process audio in more than one language when you declare the language at session start. Code-switching means the model detects and handles mid-conversation language changes automatically. The distinction matters because real bilingual call center interactions require automatic detection, not just multi-language support.
Does Gladia use customer audio to train its models?
On Growth and Enterprise plans, customer audio is never used for model training and no opt-out action is required. On the Starter plan, customer data can be used for model training by default.
What is the WER difference between Gladia and competitors on conversational speech?
Solaria-1 demonstrates strong performance on conversational speech compared to alternatives, with on average 29% lower WER and 3x lower DER on conversational speech across a reproducible, open methodology.
Which languages in Gladia's set are not available elsewhere?
Gladia covers languages not found in any other API-level STT provider, including Tagalog, Bengali, Punjabi, Tamil, Urdu, Persian, Marathi, Haitian Creole, Maori, and Javanese. The full supported language list is in Gladia's public API reference.
Is speaker diarization available in real-time mode?
No. Gladia's diarization feature, powered by pyannoteAI's Precision-2 model, is available in async workflows only. The diarization documentation and the diarization deep-dive cover the technical configuration in detail.
Key terms glossary
Word error rate (WER): A metric measuring transcription accuracy by comparing the transcript to a reference transcription. Lower WER indicates higher transcription accuracy.
Diarization error rate (DER): A metric measuring the accuracy of speaker attribution in an audio recording, including detection of speech segments and assignment to speakers.
Code-switching: The practice of alternating between languages within a conversation or sentence. Automatic code-switching detection handles this without requiring manual language selection.
Speaker diarization: The process of partitioning an audio recording into segments by speaker, identifying who spoke when in a conversation.
Total cost of ownership (TCO): The full cost of a vendor integration including base transcription rate, all feature add-ons, data egress fees, and operational overhead such as integration maintenance and cost monitoring at scale.
Data residency: The requirement that data be stored and processed within a specific geographic boundary for regulatory compliance. Gladia offers EU-west and US-west region options with on-premises deployment for strict residency requirements.