Many Contact Center as a Service (CCaaS) platforms default to real-time streaming transcription across all workloads, reasoning that faster is always better. That assumption costs a 25% pricing premium per hour based on industry pricing standards and degrades the accuracy of every QA scorecard, CRM record, and coaching output that flows from those transcripts.
Treating async batch processing as a slower version of streaming is a fundamental misunderstanding of how the two modes work and which contact center workflows each actually serves. Getting this right affects transcript accuracy, which directly impacts every downstream system from QA scorecards to CRM enrichment.
Optimizing transcription costs through async workflows
Async transcription processes a recorded audio file via a REST API call and returns a complete, structured transcript once analysis is finished. For post-call workloads, this is entirely sufficient because no agent or system is waiting on the transcript while the call is live.
Our Growth tier async transcription is as low as $0.20 per hour, with real-time streaming priced higher. At high volumes, this difference compounds each month. Our Growth and Enterprise pricing includes speaker diarization, translation, sentiment analysis, named entity recognition, and summarization in the base rate. Competitors like AssemblyAI charge separately: using AssemblyAI Universal-2 with diarization, sentiment, entity detection, and summarization can stack additional costs. Deepgram prices Audio Intelligence features per token rather than per minute, making cost projection for high-volume call centers inherently difficult.
Where batch transcription outperforms streaming
Batch processing submits the entire audio file before any output is generated. The model processes the full recording before committing to a transcript, which enables it to resolve phoneme ambiguities using fuller context. A wrong product code or agent name captured in the first pass silently corrupts every downstream record it touches: the CRM entry, the coaching score, the compliance log.
Real-time models operate under latency constraints that limit how much surrounding audio they can use per processing step. That constraint reduces their ability to resolve ambiguous speech segments, which is why batch processing often delivers lower WER than streaming for post-call analytics. For post-call analytics, QA scoring, and CRM enrichment, that accuracy difference is not marginal. It is the difference between automated scoring you can trust and transcripts that require manual review to be usable.
A 10-minute call processes rapidly in batch mode. This is not a bottleneck. Aircall cut transcription time by 95% (from 30 minutes to 1.5 minutes per call) and now processes over 1M calls per week through our async pipeline.
Why batching outperforms real-time for diarization
Speaker diarization is powered by pyannoteAI's Precision-2 model, which requires the complete audio recording to produce accurate speaker attribution. The model needs access to the full conversation to reconcile overlapping speech or speaker turns that span the entire recording.
Our async benchmark covers 74+ hours of audio across 8 providers and shows Solaria-1 achieves on average 3x lower DER than alternatives. For European business and contact-center audio where both WER and DER matter, Solaria-3 delivers strong overall accuracy on async workflows.
Use cases requiring live transcription
Real-time assist for live calls
Agent assist is a key use case for real-time streaming. When a customer asks a billing question or raises a complaint, a live transcript fed to a knowledge-base retrieval layer can surface the correct response quickly, before the agent has to place the caller on hold or search manually. Agent-assist cards that surface the correct response while the call is live eliminate the need to place callers on hold to search manually, which is the condition that drives AHT up on complex call types. The output needs to arrive while the call is still active and the agent can act on it, and that is what makes WebSocket streaming architecturally justified for this workflow.
IVR routing and intent detection
AI-powered IVR can use the customer's spoken intent at the start of a call to route it to the correct queue or agent. Natural language understanding can extract intent from transcribed speech and match it to a destination, reducing transfers and shortening AHT. This is a real-time use case because the routing decision must happen before the call reaches an agent.
Per industry benchmarking data, the average FCR benchmark sits at 70% across industries, with a 1% improvement in FCR correlating with approximately 1.4 points of NPS gain.
Defining real-time voice use cases
Real-time streaming is justified for specific CCaaS use cases. Everything else belongs on async.
- Agent assist: Live knowledge-base retrieval and compliance prompts surfaced during the call.
- IVR routing and intent detection: Sub-300ms classification to route the call before queue entry.
- Voice agents: Conversational AI where the STT output feeds an LLM response within the same conversational turn.
- Live captions and accessibility: Real-time captions for hearing-impaired agents or supervisors monitoring a call floor.
If your workflow is not on this list, defaulting to async will produce better transcripts at lower cost. Our guide on choosing the right automation use case provides a broader framework.
"Excellent multilingual real-time transcription with smooth language switching... Superior accuracy on accented speech compared to competitors... Clean API, easy to integrate and deploy to production." - Yassine R. on G2
Accuracy differences: choosing between Solaria-3 and Solaria-1
Solaria-3 benchmarks for post-call accuracy
Solaria-3 is our model optimized for real-world European business audio across English, French, German, Spanish, and Italian, designed for async workflows. It is the right choice for post-call QA scoring, conversation intelligence, and CRM enrichment where you need the lowest possible WER on the audio your contact center actually handles.
Benchmarks on contact-center-specific audio:
- 9.6% WER on real customer English audio, ranking #1 against AssemblyAI, ElevenLabs, Deepgram, Mistral Voxtral, and Speechmatics.
- 6.4% WER on Earnings22 financial call audio (the only model under 7% on this dataset).
- 33.9% WER on Switchboard conversational audio (the only model under 35% on this dataset).
- 26% improvement in WER over Solaria-1 on real English production audio.
Solaria-1 latency benchmarks for live calls
Solaria-1 is our recommended model for real-time streaming use cases. It delivers final transcripts at approximately 300ms latency, with partial updates returned in under 103ms, which keeps agent-assist cards and IVR routing within the latency budget for conversational AI.
Solaria-1 covers 100+ languages, including 42 that no other API-level STT provider supports: Tagalog, Bengali, Punjabi, Tamil, Urdu, Persian, Marathi, and others that matter specifically for BPO operations in Southeast Asia and South Asia. It also supports native code-switching, detecting mid-conversation language changes in real time without producing garbled output or a broken session.
Cost comparison: async vs real-time at contact center scale
Table 1: Real-time vs asynchronous STT comparison
| Feature |
Real-time streaming |
Asynchronous batch |
| Architecture |
WebSocket (stateful) |
REST API (stateless) |
| Latency |
~300ms final, <103ms partials |
Processes rapidly (under 1 min per hour typical) |
| Processing method |
Incremental output |
Complete audio file before output |
| Speaker diarization |
Limited availability |
pyannoteAI Precision-2 (async-only) |
| Primary use cases |
Agent assist, IVR, voice agents |
QA scoring, CRM enrichment, compliance |
| Primary model |
Solaria-1 |
Solaria-3 (EU business audio), Solaria-1 (broader language coverage) |
Table 2: Unit economics cost projection (Growth plan, all-inclusive)
| Monthly volume |
Gladia async (Growth tier) |
Gladia real-time (Growth tier) |
Competitor bundled est. |
| 1,000 hours |
$200 |
$250 |
Higher |
| 5,000 hours |
$1,000 |
$1,250 |
Significantly higher |
| 10,000+ hours |
$2,000+ |
$2,500+ |
Contact for quote |
Competitor pricing structures vary, with some providers charging separately for features like diarization, sentiment, entity detection, and summarization, per pricing analyses. Our figures on Growth tier include all those features at the base rate.
WebSocket infrastructure carries operational considerations that differ from REST APIs. WebSocket connections are persistent, and scaling considerations differ from stateless REST. For post-call workflows specifically, async batch processing via REST delivers the lowest WER on your post-call audio (9.6% on real English recordings for Solaria-3) alongside simpler infrastructure, making it the appropriate choice for QA and CRM enrichment where real-time latency is not required.
On data governance: our Growth and Enterprise plans never use customer data for model training, with no opt-out action required. On the Starter plan, data may be used for training by default. For regulated contact centers handling healthcare or financial services audio, paid plans provide data protection guarantees. Our compliance stack covers GDPR, HIPAA, SOC 2 Type II, and ISO 27001.
How to map transcription modes to CX workflows
Prioritize async for QA and CRM data
Automated QA scoring depends on transcript accuracy. When the transcript contains wrong names, misattributed speakers, or missed compliance phrases, QA scores can mislead leadership rather than guide coaching. Our analysis of business call transcript analysis techniques covers the downstream data requirements in detail.
WER directly determines which downstream systems you can trust. A wrong name silently corrupts a CRM entry. A missed disclosure creates a compliance gap that surfaces during audit, not during QA sampling. Reliably capturing what a support call should include requires transcript accuracy that async workflows on Solaria-3 deliver and real-time streaming cannot match for this audio type. Our async pipeline produces structured JSON output covering entity extraction, speaker attribution, sentiment, and summaries in a single API call.
Limit streaming to agent assist needs
Keep WebSocket connections reserved for interactions where an agent is live on the call and can act on the transcript in real time. For everything that happens after the call ends, REST is the right architecture for simplicity and cost-effectiveness. This is the configuration that modernized contact center architectures use at production scale.
Should you choose real-time or async transcription?
The architectural question resolves to a single test: does the output need to arrive while the call is live and an agent or system can act on it? If yes, real-time streaming on Solaria-1. If the output feeds a QA scorecard, a CRM record, a coaching dashboard, or a compliance archive, async batch on Solaria-3.
Contact center operations that default to streaming for all workloads pay a price premium for post-call audio, may accept lower WER than their QA workflows need, and maintain WebSocket infrastructure where simpler async REST would suffice for the majority of their call volume. The stronger risk is trusting downstream systems built on lower-accuracy transcripts: a wrong agent name corrupts the CRM record, a missed compliance phrase creates an audit gap, a misattributed speaker inverts a coaching score. Accurate transcription is foundational to service consistency in contact centers.
For teams currently self-hosting or using legacy providers, migration paths to a unified REST and WebSocket API are available, with documentation for providers like Deepgram and AssemblyAI. Multiple customers report sub-24-hour integration times, and our engineers are available directly, not via a ticket queue.
Start with 10 free hours on our Starter plan to test both modes against your own contact center audio. Then test Solaria-3 on your noisiest, most accented BPO recordings and compare WER against what your current provider returns.
FAQs
Does Gladia support real-time speaker diarization?
No. High-quality speaker diarization powered by pyannoteAI's Precision-2 model is only available in our asynchronous batch workflows. Real-time streaming uses different approaches optimized for low-latency output, and for applications requiring accurate speaker attribution, async batch processing is recommended.
What is the latency of Gladia's real-time transcription?
Our real-time streaming runs on Solaria-1 and delivers a final transcript latency of approximately 300ms, with partial transcript updates returned in under 103ms.
Are customer audio files used to train Gladia's models?
On our Growth and Enterprise plans, customer data is never used for model training by default, and no opt-out action is required. On our Starter plan, data may be used for training by default.
Which Gladia model should I use for European contact center audio?
Use Solaria-3 for async post-call workflows covering English, French, German, Spanish, and Italian, where it ranks #1 on real customer English audio at 9.6% WER and achieves 6.4% WER on Earnings22 financial call audio. Use Solaria-1 for real-time streaming and for languages outside the five Solaria-3 covers.
What is the cost difference between async and real-time at 10,000 hours per month?
On our Growth tier, async can run as low as $0.20 per hour and real-time at $0.25 per hour. At 10,000 hours per month with these rates, the difference would be $500 monthly. Compared to competitor bundled pricing, our Growth tier pricing can deliver substantial savings at high volumes.
Can I run both async and real-time transcription through a single Gladia integration?
Yes. Our platform covers both modes through a unified integration approach, with async calls using REST and real-time calls using WebSocket, sharing the same authentication and the same structured output format.
Key terms glossary
Word Error Rate (WER): The standard metric for speech-to-text accuracy, calculated by dividing the sum of insertions, deletions, and substitutions by the total number of words spoken.
Diarization Error Rate (DER): The metric for speaker diarization performance, measured as the fraction of time not attributed correctly to a speaker or non-speech. DER includes speaker error, false alarms, and missed detections.
Code-switching: Alternating between two or more languages within a single conversation, which commonly occurs in multilingual contact centers.
First Call Resolution (FCR): A contact center KPI measuring the percentage of customer issues resolved during their initial interaction, directly impacted by the accuracy of agent-assist tools and CRM data.
Interactive Voice Response (IVR): An automated telephony system that routes calls based on spoken or keypad input, enabling caller self-service or intelligent queue assignment before reaching a human agent.
Speech-to-Text (STT): The process of converting spoken audio into written text transcripts, forming the foundation for all downstream audio intelligence workflows.
Average Handle Time (AHT): The average duration of a single customer call transaction, including talk time, hold time, and post-call administrative work.
WebSocket: A stateful, bidirectional communication protocol that maintains a persistent TCP connection between client and server, enabling low-latency data transfer for real-time transcription.
REST API: A stateless request-response protocol where the client submits a request, the server processes it, and the connection closes after the response is returned, used for asynchronous batch transcription.