API Comparison Table

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Get started

Speech-To-Text

Speech-to-text for AI medical scribes: Why clinical vocabulary breaks generic STT

TL;DR: Generic STT engines fail in clinical environments because language model probability overrides correct acoustic detection of medical terms, substituting phonetically plausible but clinically wrong candidates silently. The result corrupts drug names, dosages, and diagnoses before the LLM ever sees them. Before selecting an STT engine for a medical scribe, verify four things: whether vocabulary biasing works at inference time without fine-tuning, whether async diarization accurately separates clinician and patient audio, whether the model holds up on noisy consultation recordings rather than clean read-speech, and whether the vendor's data training policy covers PHI by default on your plan.

Speech-To-Text

Migrating from self-hosted Whisper to a managed speech-to-text API

TL;DR: Self-hosting Whisper's true cost rarely sits in the model weights. GPU idle time, VRAM leaks under parallel load, and the engineering hours spent maintaining CUDA dependencies and diarization pipelines are where the bill compounds. For teams processing under roughly 3,000 hours per month, assuming 20% of one US FTE at $150K loaded annual cost, a managed API is cheaper, though the break-even shifts materially against your actual labor cost. Above that threshold, the decision depends on your DevOps overhead and whether audio accuracy on real-world recordings matters for downstream systems like CRM sync and coaching scores.

Speech-To-Text

Migrating from AssemblyAI to Gladia: A step-by-step switching guide

TL;DR: Switching from AssemblyAI requires four concrete changes: update one auth header, remap batch endpoints, adjust the JSON response schema, and resample audio for WebSocket connections. Multiple customers independently report completing these in under a day with a rollback abstraction layer in place. The bigger structural difference is cost model: a production stack with diarization, sentiment, entities, and summarization runs $0.30/hr on AssemblyAI's Universal-2 tier because each feature is metered separately, versus a bundled base rate. This guide covers the exact parameter mappings, payload diffs, WebSocket reconfiguration, and a zero-downtime cutover strategy.

AI call analytics platforms vs. STT APIs: which is right for multilingual transcription?

Published on June 19, 2026

by Ani Ghazaryan

TL;DR: Multilingual transcription at scale reveals complexities beyond English benchmarks: while some modern platforms handle code-switching, many models still treat language switches as error states, producing high WER and transcript degradation. Some platforms charge separately for diarization, sentiment, and summarization features. Gladia's Growth plans bundle those features at the base rate. Gladia's Solaria-1 model is benchmarked on conversational speech, with native code-switching across 100+ languages.

Teams building CCaaS platforms, shipping meeting assistants, or running contact center operations often encounter unexpected challenges when evaluating transcription APIs on English accuracy benchmarks and then scaling to multilingual markets. Modern platforms vary in their handling of code-switching when bilingual support agents switch languages mid-call, and pricing structures differ across vendors for features like diarization, translation, and entity extraction. This guide compares Tier 1 full-stack platforms and Tier 2 API providers across language coverage, code-switching, accent handling, and pricing at realistic production volumes, so you can make a defensible architecture decision before your unit economics break at scale.

Key metrics for multilingual call analytics selection

Before comparing vendors, you need agreement on what you're actually measuring. English WER in a studio environment may not reliably predict production performance on calls with background noise, multiple speakers, or code-switching scenarios common in global contact centers.

Language-specific accuracy

A provider's headline language count is a marketing number, not a production metric. There are two distinct dimensions that actually matter: breadth and depth.

Breadth means how many languages a model supports at all. Depth means how accurately it transcribes a specific language under real conditions: specific accents, noisy audio, code-switching mid-sentence.

These are not the same thing, and optimising for one does not guarantee the other. A model that nominally supports 100 languages may have trained primarily on English and a handful of European languages. WER on Tagalog, Bengali, or Urdu can be materially higher than headline accuracy figures suggest.

Gladia runs two models that address these dimensions differently:

Solaria-1 is the breadth model: 100+ languages, true mid-conversation code-switching, real-time streaming. Best for global multilingual products, rare languages, and contact centers with diverse caller demographics across Southeast Asia, South Asia, and Latin America.

Solaria-3 is the depth model: highest accuracy on real-world European business audio across English, French, German, Spanish, and Italian. Best for contact centers and CCaaS platforms serving European markets with noisy, accented, or conversational audio. Solaria-3ranks #1 against AssemblyAI, ElevenLabs, Deepgram, Mistral, and Speechmatics on real customer recordings: 6.4% WER on Earnings22 financial calls (only model under 7%), 33.9% on Switchboard conversational speech (only model under 35%). Solaria-3 is async only.

The real question is what WER looks like on your specific language, accent, and audio condition. The only reliable answer comes from testing on your own production recordings.

Accuracy for diverse accents

Accent handling is a training data problem. Models trained primarily on specific accent groups may produce elevated WER on diverse accents even when the language is nominally "supported." Contact centers processing calls from regions with distinct accents need to test WER on audio from those regions specifically, not rely on aggregate benchmark numbers that may weight certain datasets heavily.

Latency and throughput trade-offs

The threshold that matters depends on the use case. If your workflow is post-call analytics, processing 10 minutes of audio quickly is the relevant metric. Live agent assist and real-time applications typically require final transcript latency under 300ms to avoid disrupting conversation flow. Gladia supports both: async processing and real-time transcription at approximately 300ms latency, with async as the primary strength for call analytics workflows.

Unit economics for AI platforms

The total cost of ownership calculation must include every feature you'll actually use in production. Base rate comparisons look competitive until diarization, translation, sentiment, and NER each carry a separate line item and the final per-hour cost is 2-3x the advertised headline. Model your costs at 1,000, 10,000, and 100,000 hours per month with all features enabled before committing.

Tier 1: Full-stack call analytics platforms

Full-stack platforms give you a pre-built analytics UI, workflow tools for agent coaching, and QA dashboards built for contact center operations teams who want insights without a development investment. The trade-off is pricing structures that may not flex well with audio volume, and language depth documentation that typically lags dedicated STT infrastructure.

CallMiner: unit costs and language support

CallMiner is an enterprise conversation analytics platform with modules for agent coaching and live guidance, with integration APIs for third-party telephony systems.

Where it fits: Large enterprise contact centers that need compliance-focused QA with human review workflows.

Where it falls short: CallMiner pricing is not listed on their public website as of this writing. Contact their sales team directly for cost modeling at your target volume.

Genesys: multilingual accuracy benchmarks

Genesys delivers conversation intelligence within Genesys Cloud for sentiment, empathy, and topic detection. The native integration between speech analytics and workforce engagement management means QA scoring and coaching feed directly from transcription without additional integration work.

Where it fits: Contact centers already on Genesys Cloud who need analytics tightly coupled to their routing and WEM workflows.

Where it falls short: Genesys analytics is an extension of the Genesys platform. If your call recording infrastructure sits outside Genesys, or you're building a custom CCaaS product, that coupling becomes a constraint.

Observe.AI: multilingual accent support and latency

Observe.AI is a conversation intelligence platform that automates QA scoring, enabling performance evaluation across interaction volumes that would overwhelm human QA teams.

Where it fits: Mid-market contact centers automating QA scoring and reducing sampling bias in performance reviews.

Where it falls short: Teams targeting South Asian or Southeast Asian language traffic should test multilingual performance and code-switching capability specifically before committing.

Why choose end-to-end AI analytics?

Full-stack platforms make sense when you're buying a solution for a contact center operations team, not building infrastructure for a product. When your goal is to ship a custom CCaaS feature, a meeting assistant, or a post-call analytics pipeline as part of your own product, a full-stack platform becomes a dependency that limits both your flexibility and your unit economics at scale.

API accuracy for multilingual call transcription

API-first providers give you the transcription and audio intelligence layer, and you build the product logic on top. This adds development work, but you control the pipeline, own the cost structure, and can optimize for the specific audio conditions your users generate.

Platform	Target audience	Pricing model	Code-switching support
Gladia	CCaaS builders, meeting assistants, voice agents	Starter $0.61/hr async, Growth as low as $0.20/hr, all features at base rate	100+ languages, automatic
AssemblyAI	Developer teams	Base rate plus per-feature add-ons	Universal-Streaming: 6 languages (EN, ES, FR, DE, IT, PT), intra-utterance
Deepgram	Voice pipeline developers	Per-minute; diarization billed separately (~$0.0020/min)	Flux Multilingual: 10 languages, native mid-call

‍

Multilingual accuracy via Gladia's code-switching

Solaria-1 covers 100+ languages, with automatic code-switching detection working across the full language set in both async and real-time modes. When a speaker shifts from English to Tagalog mid-sentence, Solaria-1 detects the change automatically without a language parameter reset. The same applies to Bengali, Punjabi, Tamil, Urdu, Persian, and Marathi, which are primary traffic languages for South Asia and Southeast Asian BPO operations, not edge cases.

Claap, a video collaboration platform for product teams, reports 1-3% WER in production on Gladia's infrastructure.

Here's what a typical Gladia API response looks like when Solaria-1 detects a language switch in a call recording:

{
  "utterances": [
    {
      "speaker": 0,
      "start": 0.0,
      "end": 3.5,
      "text": "So the issue is with your last invoice,",
      "language": "en",
      "confidence": 0.97
    },
    {
      "speaker": 1,
      "start": 3.6,
      "end": 7.2,
      "text": "si, el total no es correcto para este mes",
      "language": "es",
      "confidence": 0.95
    }
  ]
}

Each utterance carries a language label alongside speaker attribution from pyannoteAI's Precision-2 diarization model.

AssemblyAI: multilingual WER benchmarks

AssemblyAI offers speech-to-text models with automatic language detection, and its Universal-Streaming model handles six languages (English, Spanish, French, German, Italian, and Portuguese) with intra-utterance code-switching. Teams moving an existing AssemblyAI integration across can map the API and parameter differences to Gladia using the AssemblyAI-to-Gladia migration guide directly.

Pricing reality: AssemblyAI offers a base transcription rate with separate feature add-ons. Check current pricing documentation for the full feature set your production pipeline requires.

Data policy: Review AssemblyAI's data usage terms for model training policies and opt-out options.

Deepgram: low-latency live transcription

Deepgram's Nova-3 delivers sub-300ms streaming latency via WebSocket per its documentation, which makes it a candidate for voice agent pipelines where STT output feeds directly into an LLM response loop. Its Flux Multilingual model adds native mid-call code-switching across 10 languages. Per Deepgram's public pricing, speaker diarization is billed as a separate add-on (about $0.0020/min, roughly $0.12/hour) on top of the Nova-3 base rate, so teams evaluating both can size the API differences when migrating from Deepgram to Gladia against that cost.

Data policy: Review Deepgram's data usage terms for model training policies and opt-out mip_opt_out=true.

Code-switching: native feature vs. afterthought

Code-switching support on most ASR platforms was added as an afterthought to a model architecture designed for monolingual audio. Understanding the technical distinction between genuine code-switching detection and simple multi-language transcription is one of the most consequential decisions a product team makes during vendor selection.

Intra-sentence code-switching support

The hardest version of code-switching is intra-sentence: a speaker completes half a thought in one language and finishes it in another. This is common across South Asia, Southeast Asia, and Latin America, and it's where most transcription APIs fail. A model requiring a language parameter at session initiation cannot handle this at all. A model detecting language per utterance will segment the switch but may lose context at the boundary. Solaria-1's code-switching detection works automatically across all supported languages, which is why code-switching failures in contact centers carry such direct downstream cost for global CCaaS builders.

The only reliable validation test is your own production audio. Take 10-20 hours of real call recordings from your highest-traffic non-English markets, run them through each candidate API with all features enabled, and compute WER against a human-verified transcript. Focus specifically on entity accuracy: wrong names, phone numbers, and product codes produce silent failures downstream in CRM writes and coaching scores.

Improving code-switching accuracy

Code-switching compounds entity recognition errors because the model must simultaneously handle the language transition and resolve domain-specific terms such as product names, brand names, and industry jargon that may appear in either language, often with non-standard spellings or romanizations. When a speaker shifts from English to Tagalog mid-sentence while naming an insurance product, the model is managing both the language boundary and an out-of-vocabulary term at the same moment.

Custom vocabulary lets you inject product names, brand terms, and industry-specific phrases that the base model may not have encountered in training data. For contact centers in insurance, financial services, or healthcare, custom vocabulary improves entity accuracy directly. Custom spelling handles brand names with non-standard romanizations. Both features are included in Gladia's Starter and Growth plans without additional fees.

Global language and accent coverage

A language count is a marketing number. What matters is which languages are supported at production-grade accuracy, and whether those languages include the specific ones your users actually speak.

Platform	Code-switching languages	Code-switching type	Unique languages	Data retraining default
Gladia (Solaria-1)	100+ languages supported	Automatic, async and real-time	42 exclusive languages (see full list)	Not used on Growth/Enterprise (no opt-out required)
Gladia (Solaria-3)	EN, FR, DE, ES, IT	Async only	EN, FR, DE, ES, IT (depth model, not breadth)	Not used on Growth/Enterprise (no opt-out required)
AssemblyAI (Universal-Streaming)	6 (EN, ES, FR, DE, IT, PT)	Intra-utterance, single model	Not published	Trains by default, opt-out on paid plans only
Deepgram (Flux Multilingual)	10	Native mid-call	Not published	Model Improvement Program by default; opt-out via `mip_opt_out=true` per request
CallMiner	Not published	Not published	Not published	Not published
Genesys	Not published	Within Genesys Cloud	Not published	Review vendor terms
Observe.AI	Not published	Not published	Not published	Review vendor terms

‍

The 42 languages exclusive to Gladia include Tagalog, Bengali, Punjabi, Tamil, Urdu, Persian, Marathi, Haitian Creole, Maori, and Javanese. For contact centers running BPO operations in the Philippines, India, or Indonesia, these are primary traffic languages, not edge cases.

Automatic language detection in Solaria-1 is designed to identify language correctly despite strong accents, addressing a common failure point in multilingual transcription systems.

Understanding the real cost: base rates and add-ons

The table below shows base rates and how diarization is billed, since a production call analytics pipeline typically needs diarization plus sentiment and summarization. Adding entity detection to AssemblyAI pushes the effective rate higher.

API per-hour costs and add-on structure

Provider	Base rate (async)	Diarization	Notes
Gladia Starter	$0.61/hr	Included	All audio intelligence at base rate
Gladia Growth	As low as $0.20/hr	Included	All features included, volume commitment
AssemblyAI	Base rate plus per-feature add-ons	Billed separately	Sentiment and summarization also add-ons
Deepgram Nova-3	~$0.46/hr monolingual, ~$0.55/hr multilingual ($0.0077–$0.0092/min)	+~$0.12/hr ($0.0020/min)	Per Deepgram's public pricing

‍

When comparing pricing, model your costs at realistic production volumes with all features enabled. Some platforms charge separately for diarization, sentiment, and summarization, while others bundle these features at the base rate.

Accuracy benchmarks from independent tests

Vendor-published accuracy numbers usually come from curated studio recordings, which is why they rarely predict production performance. An open, reproducible benchmark on realistic audio is a better starting point for evaluation.

Gladia's open-sourceasync benchmark evaluates Solaria-1 against eight providers across seven datasets and 74+ hours of audio, using realistic production conditions (background noise, accented speech, multi-speaker calls) rather than controlled datasets. On that basis, Solaria-1 delivers on average 29% lower WER and 3x lower DER on conversational speech than alternatives.

Key considerations when evaluating multilingual call analytics platforms

What WER should I expect on noisy call center audio?

WER on noisy contact center audio is consistently higher than clean-studio benchmarks suggest. As a reference point: on the Earnings22 financial calls dataset (real business audio), Solaria-3 achieves 6.4% WER, the only model under 7% in its benchmark cohort. On Switchboard conversational speech, Solaria-3 reaches 33.9% WER, the only model under 35%. For real customer English recordings, Solaria-3 reaches 9.6% WER vs 9.9% for ElevenLabs, 10.0% for AssemblyAI, and 10.7% for Deepgram.

In production, Claap reports 1–3% WER on meeting audio through Gladia. The acceptable threshold depends on the downstream use case: CRM data population requires lower WER than summarization, because errors in named entities and numbers compound into data quality failures. For numerical accuracy on policy numbers and financial data, your test set must include real telephony recordings from your production queues — clean-audio benchmarks do not predict this.

API integration timeframes

Production integrations move quickly on the Gladia Python and JavaScript SDKs, with native support for Pipecat, LiveKit, Twilio, and Retell documented in the API reference. Direct Slack access to Gladia engineers compresses evaluation cycles from weeks to days.

"The team is very reactive and helpful. The product works great." - Robin L. on G2

Data usage policy

Default behavior matters more than opt-out availability.

Gladia: Customer data can be used for model training by default. Gladia Growth and Enterprise: Customer data is never used for model training, and no opt-out action is required.
AssemblyAI: Customer data is licensed for model training by default. Opt-out is available on paid plans only, must be actively configured, and applies going forward.
Deepgram: Audio is enrolled in the Model Improvement Program by default. Opting out requires adding mip_opt_out=true to every API request, which forgoes the standard pricing discount.

How does Tagalog and Hindi transcription perform?

Gladia's Solaria-1 covers Tagalog, Hindi, Bengali, Punjabi, Tamil, Urdu, and Marathi, languages not available in any other API-level STT provider. For contact centers with Philippines or India BPO operations, this means competitors nominally "supporting" these languages often have insufficient training data depth, producing elevated WER on regional accents and colloquial speech. The only reliable validation is testing on production recordings from those regions. Run 5–10 hours of real calls through each candidate API and measure WER against a human-verified transcript, with specific focus on entity accuracy (names, numbers, product codes) rather than aggregate WER alone. The full language list is at supported languages documentation.

Start with 10 free hours and test Gladia on your own multilingual audio, with Gladia engineers on Slack for edge cases.

FAQs

What is the difference between code-switching and multilingual transcription?

Multilingual transcription means a model can process audio in more than one language when you declare the language at session start. Code-switching means the model detects and handles mid-conversation language changes automatically. The distinction matters because real bilingual call center interactions require automatic detection, not just multi-language support.

Does Gladia use customer audio to train its models?

On Growth and Enterprise plans, customer audio is never used for model training and no opt-out action is required. On the Starter plan, customer data can be used for model training by default.

What is the WER difference between Gladia and competitors on conversational speech?

Solaria-1 demonstrates strong performance on conversational speech compared to alternatives, with on average 29% lower WER and 3x lower DER on conversational speech across a reproducible, open methodology.

Which languages in Gladia's set are not available elsewhere?

Gladia covers languages not found in any other API-level STT provider, including Tagalog, Bengali, Punjabi, Tamil, Urdu, Persian, Marathi, Haitian Creole, Maori, and Javanese. The full supported language list is in Gladia's public API reference.

Is speaker diarization available in real-time mode?

No. Gladia's diarization feature, powered by pyannoteAI's Precision-2 model, is available in async workflows only. The diarization documentation and the diarization deep-dive cover the technical configuration in detail.

Key terms glossary

Word error rate (WER): A metric measuring transcription accuracy by comparing the transcript to a reference transcription. Lower WER indicates higher transcription accuracy.

Diarization error rate (DER): A metric measuring the accuracy of speaker attribution in an audio recording, including detection of speech segments and assignment to speakers.

Code-switching: The practice of alternating between languages within a conversation or sentence. Automatic code-switching detection handles this without requiring manual language selection.

Speaker diarization: The process of partitioning an audio recording into segments by speaker, identifying who spoke when in a conversation.

Total cost of ownership (TCO): The full cost of a vendor integration including base transcription rate, all feature add-ons, data egress fees, and operational overhead such as integration maintenance and cost monitoring at scale.

Data residency: The requirement that data be stored and processed within a specific geographic boundary for regulatory compliance. Gladia offers EU-west and US-west region options with on-premises deployment for strict residency requirements.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Speech-To-Text

Medical speech-to-text for AI scribe builders

Speech-To-Text

Migrating from self-hosted Whisper to a managed speech-to-text API

Speech-To-Text

AssemblyAI to Gladia migration guide: API mapping & setup

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Read more

Speech-to-text for AI medical scribes: Why clinical vocabulary breaks generic STT

Migrating from self-hosted Whisper to a managed speech-to-text API

Migrating from AssemblyAI to Gladia: A step-by-step switching guide

AI call analytics platforms vs. STT APIs: which is right for multilingual transcription?

Key metrics for multilingual call analytics selection

Language-specific accuracy

Accuracy for diverse accents

Latency and throughput trade-offs

Unit economics for AI platforms

Tier 1: Full-stack call analytics platforms

CallMiner: unit costs and language support

Genesys: multilingual accuracy benchmarks

Observe.AI: multilingual accent support and latency

Why choose end-to-end AI analytics?

API accuracy for multilingual call transcription

Multilingual accuracy via Gladia's code-switching

AssemblyAI: multilingual WER benchmarks

Deepgram: low-latency live transcription

Code-switching: native feature vs. afterthought

Intra-sentence code-switching support

Improving code-switching accuracy

Global language and accent coverage

Understanding the real cost: base rates and add-ons

API per-hour costs and add-on structure

Accuracy benchmarks from independent tests

Key considerations when evaluating multilingual call analytics platforms

What WER should I expect on noisy call center audio?

API integration timeframes

Data usage policy

How does Tagalog and Hindi transcription perform?

FAQs

What is the difference between code-switching and multilingual transcription?

Does Gladia use customer audio to train its models?

What is the WER difference between Gladia and competitors on conversational speech?

Which languages in Gladia's set are not available elsewhere?

Is speaker diarization available in real-time mode?

Key terms glossary

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.