API Comparison Table

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Sign up

Get started

Speech-to-text for AI medical scribes: Why clinical vocabulary breaks generic STT

TL;DR: Generic STT engines fail in clinical environments because language model probability overrides correct acoustic detection of medical terms, substituting phonetically plausible but clinically wrong candidates silently. The result corrupts drug names, dosages, and diagnoses before the LLM ever sees them. Before selecting an STT engine for a medical scribe, verify four things: whether vocabulary biasing works at inference time without fine-tuning, whether async diarization accurately separates clinician and patient audio, whether the model holds up on noisy consultation recordings rather than clean read-speech, and whether the vendor's data training policy covers PHI by default on your plan.

Speech-To-Text

Migrating from self-hosted Whisper to a managed speech-to-text API

TL;DR: Self-hosting Whisper's true cost rarely sits in the model weights. GPU idle time, VRAM leaks under parallel load, and the engineering hours spent maintaining CUDA dependencies and diarization pipelines are where the bill compounds. For teams processing under roughly 3,000 hours per month, assuming 20% of one US FTE at $150K loaded annual cost, a managed API is cheaper, though the break-even shifts materially against your actual labor cost. Above that threshold, the decision depends on your DevOps overhead and whether audio accuracy on real-world recordings matters for downstream systems like CRM sync and coaching scores.

Speech-To-Text

Migrating from AssemblyAI to Gladia: A step-by-step switching guide

TL;DR: Switching from AssemblyAI requires four concrete changes: update one auth header, remap batch endpoints, adjust the JSON response schema, and resample audio for WebSocket connections. Multiple customers independently report completing these in under a day with a rollback abstraction layer in place. The bigger structural difference is cost model: a production stack with diarization, sentiment, entities, and summarization runs $0.30/hr on AssemblyAI's Universal-2 tier because each feature is metered separately, versus a bundled base rate. This guide covers the exact parameter mappings, payload diffs, WebSocket reconfiguration, and a zero-downtime cutover strategy.

Real-time vs async transcription for contact centers: When streaming is worth the cost

Published on June 19, 2026

by Ani Ghazaryan

TL;DR: The decision between real-time and asynchronous transcription is not a latency question; it is an architectural fit question. Treating async batch processing as a slower version of streaming misunderstands how both modes work and which workflows each actually serves. Most contact center workloads (post-call QA scoring, conversation intelligence, CRM enrichment, and compliance archiving) belong on async batch transcription, which accesses full conversational context, delivers lower Word Error Rates, and costs 20% less per hour than streaming. Reserve real-time WebSocket streaming for the narrow set of live-call use cases where sub-300ms latency is a functional requirement: live agent assist, IVR routing, and voice agents. Both modes are available through the same platform, so the choice is about fit and cost, not vendor switching.

Many Contact Center as a Service (CCaaS) platforms default to real-time streaming transcription across all workloads, reasoning that faster is always better. That assumption costs a 25% pricing premium per hour based on industry pricing standards and degrades the accuracy of every QA scorecard, CRM record, and coaching output that flows from those transcripts.

Treating async batch processing as a slower version of streaming is a fundamental misunderstanding of how the two modes work and which contact center workflows each actually serves. Getting this right affects transcript accuracy, which directly impacts every downstream system from QA scorecards to CRM enrichment.

Optimizing transcription costs through async workflows

Async transcription processes a recorded audio file via a REST API call and returns a complete, structured transcript once analysis is finished. For post-call workloads, this is entirely sufficient because no agent or system is waiting on the transcript while the call is live.

Our Growth tier async transcription is as low as $0.20 per hour, with real-time streaming priced higher. At high volumes, this difference compounds each month. Our Growth and Enterprise pricing includes speaker diarization, translation, sentiment analysis, named entity recognition, and summarization in the base rate. Competitors like AssemblyAI charge separately: using AssemblyAI Universal-2 with diarization, sentiment, entity detection, and summarization can stack additional costs. Deepgram prices Audio Intelligence features per token rather than per minute, making cost projection for high-volume call centers inherently difficult.

Where batch transcription outperforms streaming

Batch processing submits the entire audio file before any output is generated. The model processes the full recording before committing to a transcript, which enables it to resolve phoneme ambiguities using fuller context. A wrong product code or agent name captured in the first pass silently corrupts every downstream record it touches: the CRM entry, the coaching score, the compliance log.

Real-time models operate under latency constraints that limit how much surrounding audio they can use per processing step. That constraint reduces their ability to resolve ambiguous speech segments, which is why batch processing often delivers lower WER than streaming for post-call analytics. For post-call analytics, QA scoring, and CRM enrichment, that accuracy difference is not marginal. It is the difference between automated scoring you can trust and transcripts that require manual review to be usable.

A 10-minute call processes rapidly in batch mode. This is not a bottleneck. Aircall cut transcription time by 95% (from 30 minutes to 1.5 minutes per call) and now processes over 1M calls per week through our async pipeline.

Why batching outperforms real-time for diarization

Speaker diarization is powered by pyannoteAI's Precision-2 model, which requires the complete audio recording to produce accurate speaker attribution. The model needs access to the full conversation to reconcile overlapping speech or speaker turns that span the entire recording.

Our async benchmark covers 74+ hours of audio across 8 providers and shows Solaria-1 achieves on average 3x lower DER than alternatives. For European business and contact-center audio where both WER and DER matter, Solaria-3 delivers strong overall accuracy on async workflows.

Use cases requiring live transcription

Real-time assist for live calls

Agent assist is a key use case for real-time streaming. When a customer asks a billing question or raises a complaint, a live transcript fed to a knowledge-base retrieval layer can surface the correct response quickly, before the agent has to place the caller on hold or search manually. Agent-assist cards that surface the correct response while the call is live eliminate the need to place callers on hold to search manually, which is the condition that drives AHT up on complex call types. The output needs to arrive while the call is still active and the agent can act on it, and that is what makes WebSocket streaming architecturally justified for this workflow.

IVR routing and intent detection

AI-powered IVR can use the customer's spoken intent at the start of a call to route it to the correct queue or agent. Natural language understanding can extract intent from transcribed speech and match it to a destination, reducing transfers and shortening AHT. This is a real-time use case because the routing decision must happen before the call reaches an agent.

Per industry benchmarking data, the average FCR benchmark sits at 70% across industries, with a 1% improvement in FCR correlating with approximately 1.4 points of NPS gain.

Defining real-time voice use cases

Real-time streaming is justified for specific CCaaS use cases. Everything else belongs on async.

Agent assist: Live knowledge-base retrieval and compliance prompts surfaced during the call.
IVR routing and intent detection: Sub-300ms classification to route the call before queue entry.
Voice agents: Conversational AI where the STT output feeds an LLM response within the same conversational turn.
Live captions and accessibility: Real-time captions for hearing-impaired agents or supervisors monitoring a call floor.

If your workflow is not on this list, defaulting to async will produce better transcripts at lower cost. Our guide on choosing the right automation use case provides a broader framework.

"Excellent multilingual real-time transcription with smooth language switching... Superior accuracy on accented speech compared to competitors... Clean API, easy to integrate and deploy to production." - Yassine R. on G2

Accuracy differences: choosing between Solaria-3 and Solaria-1

Solaria-3 benchmarks for post-call accuracy

Solaria-3 is our model optimized for real-world European business audio across English, French, German, Spanish, and Italian, designed for async workflows. It is the right choice for post-call QA scoring, conversation intelligence, and CRM enrichment where you need the lowest possible WER on the audio your contact center actually handles.

Benchmarks on contact-center-specific audio:

9.6% WER on real customer English audio, ranking #1 against AssemblyAI, ElevenLabs, Deepgram, Mistral Voxtral, and Speechmatics.
6.4% WER on Earnings22 financial call audio (the only model under 7% on this dataset).
33.9% WER on Switchboard conversational audio (the only model under 35% on this dataset).
26% improvement in WER over Solaria-1 on real English production audio.

Solaria-1 latency benchmarks for live calls

Solaria-1 is our recommended model for real-time streaming use cases. It delivers final transcripts at approximately 300ms latency, with partial updates returned in under 103ms, which keeps agent-assist cards and IVR routing within the latency budget for conversational AI.

Solaria-1 covers 100+ languages, including 42 that no other API-level STT provider supports: Tagalog, Bengali, Punjabi, Tamil, Urdu, Persian, Marathi, and others that matter specifically for BPO operations in Southeast Asia and South Asia. It also supports native code-switching, detecting mid-conversation language changes in real time without producing garbled output or a broken session.

Cost comparison: async vs real-time at contact center scale

Table 1: Real-time vs asynchronous STT comparison

Feature	Real-time streaming	Asynchronous batch
Architecture	WebSocket (stateful)	REST API (stateless)
Latency	~300ms final, <103ms partials	Processes rapidly (under 1 min per hour typical)
Processing method	Incremental output	Complete audio file before output
Speaker diarization	Limited availability	pyannoteAI Precision-2 (async-only)
Primary use cases	Agent assist, IVR, voice agents	QA scoring, CRM enrichment, compliance
Primary model	Solaria-1	Solaria-3 (EU business audio), Solaria-1 (broader language coverage)

‍

Table 2: Unit economics cost projection (Growth plan, all-inclusive)

Monthly volume	Gladia async (Growth tier)	Gladia real-time (Growth tier)	Competitor bundled est.
1,000 hours	$200	$250	Higher
5,000 hours	$1,000	$1,250	Significantly higher
10,000+ hours	$2,000+	$2,500+	Contact for quote

‍

Competitor pricing structures vary, with some providers charging separately for features like diarization, sentiment, entity detection, and summarization, per pricing analyses. Our figures on Growth tier include all those features at the base rate.

WebSocket infrastructure carries operational considerations that differ from REST APIs. WebSocket connections are persistent, and scaling considerations differ from stateless REST. For post-call workflows specifically, async batch processing via REST delivers the lowest WER on your post-call audio (9.6% on real English recordings for Solaria-3) alongside simpler infrastructure, making it the appropriate choice for QA and CRM enrichment where real-time latency is not required.

On data governance: our Growth and Enterprise plans never use customer data for model training, with no opt-out action required. On the Starter plan, data may be used for training by default. For regulated contact centers handling healthcare or financial services audio, paid plans provide data protection guarantees. Our compliance stack covers GDPR, HIPAA, SOC 2 Type II, and ISO 27001.

How to map transcription modes to CX workflows

Prioritize async for QA and CRM data

Automated QA scoring depends on transcript accuracy. When the transcript contains wrong names, misattributed speakers, or missed compliance phrases, QA scores can mislead leadership rather than guide coaching. Our analysis of business call transcript analysis techniques covers the downstream data requirements in detail.

WER directly determines which downstream systems you can trust. A wrong name silently corrupts a CRM entry. A missed disclosure creates a compliance gap that surfaces during audit, not during QA sampling. Reliably capturing what a support call should include requires transcript accuracy that async workflows on Solaria-3 deliver and real-time streaming cannot match for this audio type. Our async pipeline produces structured JSON output covering entity extraction, speaker attribution, sentiment, and summaries in a single API call.

Limit streaming to agent assist needs

Keep WebSocket connections reserved for interactions where an agent is live on the call and can act on the transcript in real time. For everything that happens after the call ends, REST is the right architecture for simplicity and cost-effectiveness. This is the configuration that modernized contact center architectures use at production scale.

Should you choose real-time or async transcription?

The architectural question resolves to a single test: does the output need to arrive while the call is live and an agent or system can act on it? If yes, real-time streaming on Solaria-1. If the output feeds a QA scorecard, a CRM record, a coaching dashboard, or a compliance archive, async batch on Solaria-3.

Contact center operations that default to streaming for all workloads pay a price premium for post-call audio, may accept lower WER than their QA workflows need, and maintain WebSocket infrastructure where simpler async REST would suffice for the majority of their call volume. The stronger risk is trusting downstream systems built on lower-accuracy transcripts: a wrong agent name corrupts the CRM record, a missed compliance phrase creates an audit gap, a misattributed speaker inverts a coaching score. Accurate transcription is foundational to service consistency in contact centers.

For teams currently self-hosting or using legacy providers, migration paths to a unified REST and WebSocket API are available, with documentation for providers like Deepgram and AssemblyAI. Multiple customers report sub-24-hour integration times, and our engineers are available directly, not via a ticket queue.

Start with 10 free hours on our Starter plan to test both modes against your own contact center audio. Then test Solaria-3 on your noisiest, most accented BPO recordings and compare WER against what your current provider returns.

FAQs

Does Gladia support real-time speaker diarization?

No. High-quality speaker diarization powered by pyannoteAI's Precision-2 model is only available in our asynchronous batch workflows. Real-time streaming uses different approaches optimized for low-latency output, and for applications requiring accurate speaker attribution, async batch processing is recommended.

What is the latency of Gladia's real-time transcription?

Our real-time streaming runs on Solaria-1 and delivers a final transcript latency of approximately 300ms, with partial transcript updates returned in under 103ms.

Are customer audio files used to train Gladia's models?

On our Growth and Enterprise plans, customer data is never used for model training by default, and no opt-out action is required. On our Starter plan, data may be used for training by default.

Which Gladia model should I use for European contact center audio?

Use Solaria-3 for async post-call workflows covering English, French, German, Spanish, and Italian, where it ranks #1 on real customer English audio at 9.6% WER and achieves 6.4% WER on Earnings22 financial call audio. Use Solaria-1 for real-time streaming and for languages outside the five Solaria-3 covers.

What is the cost difference between async and real-time at 10,000 hours per month?

On our Growth tier, async can run as low as $0.20 per hour and real-time at $0.25 per hour. At 10,000 hours per month with these rates, the difference would be $500 monthly. Compared to competitor bundled pricing, our Growth tier pricing can deliver substantial savings at high volumes.

Can I run both async and real-time transcription through a single Gladia integration?

Yes. Our platform covers both modes through a unified integration approach, with async calls using REST and real-time calls using WebSocket, sharing the same authentication and the same structured output format.

Key terms glossary

Word Error Rate (WER): The standard metric for speech-to-text accuracy, calculated by dividing the sum of insertions, deletions, and substitutions by the total number of words spoken.

Diarization Error Rate (DER): The metric for speaker diarization performance, measured as the fraction of time not attributed correctly to a speaker or non-speech. DER includes speaker error, false alarms, and missed detections.

Code-switching: Alternating between two or more languages within a single conversation, which commonly occurs in multilingual contact centers.

First Call Resolution (FCR): A contact center KPI measuring the percentage of customer issues resolved during their initial interaction, directly impacted by the accuracy of agent-assist tools and CRM data.

Interactive Voice Response (IVR): An automated telephony system that routes calls based on spoken or keypad input, enabling caller self-service or intelligent queue assignment before reaching a human agent.

Speech-to-Text (STT): The process of converting spoken audio into written text transcripts, forming the foundation for all downstream audio intelligence workflows.

Average Handle Time (AHT): The average duration of a single customer call transaction, including talk time, hold time, and post-call administrative work.

WebSocket: A stateful, bidirectional communication protocol that maintains a persistent TCP connection between client and server, enabling low-latency data transfer for real-time transcription.

REST API: A stateless request-response protocol where the client submits a request, the server processes it, and the connection closes after the response is returned, used for asynchronous batch transcription.

Contact us

Your request has been registered

A problem occurred while submitting the form.

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Newsletter

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

Accept

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Read more

Speech-to-text for AI medical scribes: Why clinical vocabulary breaks generic STT

Migrating from self-hosted Whisper to a managed speech-to-text API

Migrating from AssemblyAI to Gladia: A step-by-step switching guide

Real-time vs async transcription for contact centers: When streaming is worth the cost

Optimizing transcription costs through async workflows

Where batch transcription outperforms streaming

Why batching outperforms real-time for diarization

Use cases requiring live transcription

Real-time assist for live calls

IVR routing and intent detection

Defining real-time voice use cases

Accuracy differences: choosing between Solaria-3 and Solaria-1

Solaria-3 benchmarks for post-call accuracy

Solaria-1 latency benchmarks for live calls

Cost comparison: async vs real-time at contact center scale

How to map transcription modes to CX workflows

Prioritize async for QA and CRM data

Limit streaming to agent assist needs

Should you choose real-time or async transcription?

FAQs

Does Gladia support real-time speaker diarization?

What is the latency of Gladia's real-time transcription?

Are customer audio files used to train Gladia's models?

Which Gladia model should I use for European contact center audio?

What is the cost difference between async and real-time at 10,000 hours per month?

Can I run both async and real-time transcription through a single Gladia integration?

Key terms glossary

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.