What processing time should async transcription deliver?

Production-grade async transcription should return results in a small fraction of audio duration — measured in minutes per hour of audio. Legacy providers can take 10-20 minutes per hour of audio, which breaks same-shift summaries, same-day QA scoring, and timely CRM enrichment.

What is speaker diarization and why is it hard in contact centers?

Speaker diarization is the process of automatically identifying and separating different speakers in a recording. It is hard in contact centers because many legacy systems mix the agent and customer onto a single mono audio channel.

What compliance certifications should a CCaaS STT vendor have?

A CCaaS STT vendor should be SOC 2 Type 2 and ISO 27001 certified, with PII redaction built into the transcription API. For European deployments, EU data residency is increasingly a hard procurement requirement.

Gladia | Speech-to-text for contact centers: a complete guide

Q: What's the difference between async and real-time speech-to-text, and which should a CCaaS platform adopt first?

Async STT processes recorded audio after the call ends; real-time STT streams transcripts during the live call at sub-300ms latency. Most CCaaS platforms should adopt async first as the foundation for post-call transcription, summarization, QA, call intelligence, and CRM enrichment.

Customer expectations are shifting, and the contact center industry — projected to reach $45.19 billion by 2035 at a 20% CAGR — is under immense pressure. Businesses face a dual mandate: reduce operational costs while improving customer experience (CX) and agent productivity. The platforms that will win are those moving beyond manual QA sampling, fragmented call analytics, and slow post-call workflows, embracing async AI automation at full call coverage.

This guide covers how async-first, telephony-native speech-to-text is transforming contact center operations, the core challenges of processing voice data at scale, and the architectural frameworks needed to build the next generation of CCaaS platforms.

The evolution of CCaaS and the async speech AI imperative

Historically, contact centers have operated on siloed data, reactive customer service, and manual quality assurance. Modern CCaaS platforms are rebuilding around a different operating model: every call transcribed, every transcript scored, every interaction enriched into the CRM, automatically, at 100% coverage. This is the async speech AI imperative. And it is the foundation of every upsell motion that follows, including real-time agent assist and voice bots.

The industry's center of gravity sits in async workflows (post-call transcription, summarization, QA scoring, etc.) because these use cases unlock the most immediate operational ROI and serve as the entry point for platforms before they expand into real-time AI.

The platforms that will define the next decade of CX are not simply transcribing calls faster — they are rebuilding their AI stacks around async-native speech infrastructure that can process millions of minutes of audio with low WER, accurate diarization, and near-instant turnaround. As technology matures, that same foundation will extend into emerging real-time use cases (voice agents, live agent assist, etc.), where async becomes the launchpad for synchronous experiences rather than a stack to maintain. CCaaS leaders like Aircall are already piloting real-time agents and agent assist features, signaling where the category is headed.

💡 Key insight: The foundation of every automated contact center workflow (QA scoring, call summarization, sentiment analytics, CRM enrichment, compliance monitoring) is a highly accurate, fast-processing async speech-to-text engine built specifically for telephony audio.

Core challenges in modern voice infrastructure

Building async speech AI for contact centers is fundamentally different from transcribing clean, studio-recorded audio. Engineering and product leaders face several challenges when scaling their voice infrastructure to 100% call coverage.

Telephony audio quality

Contact centers handle messy, 8kHz audio streams across VoIP and SIP pipelines — including platforms like Twilio, Vonage, and Genesys. Background noise, poor connections, and telephone artifacts easily break generic transcription models that were trained on high-fidelity audio. A model that performs well in a lab will degrade significantly on real-world telephony data. At 100% coverage, that degradation compounds across millions of calls.

VoIP (Voice over Internet Protocol): A technology that transmits voice calls over the internet rather than traditional phone lines, often resulting in compressed, lower-quality audio.
SIP (Session Initiation Protocol): The signaling protocol used to initiate and manage VoIP calls; a common standard in enterprise telephony infrastructure.

Processing time at scale

Async speech AI is only “async” in the sense that it does not need to return text inside a live conversation, but that does not mean processing time is unimportant. Legacy providers often take 20+ minutes per hour of audio to return a transcript. At contact center volumes, that delay breaks every downstream workflow: agents finish their shift before summaries are generated, QA teams can't score calls the same day, and CRM enrichment falls out of sync with sales outreach. Modern async STT should return results in minutes, not hours.

Diarization on mono-channels

Many legacy contact center infrastructures record audio on a single (mono) channel, where both the agent and the customer are mixed into one track. Distinguishing the agent from the customer in these mono-channel recordings is notoriously difficult and without reliable diarization, sentiment attribution breaks, QA scoring breaks, and call intelligence breaks. You can't score an agent's script adherence if you can't tell when the agent was speaking.

Precision on critical entities

General transcription accuracy is not sufficient for contact centers. If an async model mistranscribes a specialized piece of jargon, an alphanumeric account code, a customer name, or a postal address, downstream CRM routing fails completely. Respectively, every QA rule that depends on entity recognition fires incorrectly. Precision on domain-specific entities is the accuracy metric that matters in production.

Multilingual coverage and code-switching

Contact centers operating across EMEA, LATAM, and APAC handle calls that move between languages mid-sentence. Generic ASR models trained on one language at a time fail on these calls silently: they transcribe what they can and drop the rest. For async QA and compliance use cases, those silent failures become silent compliance gaps.

Deep Dive 12 min read

Mastering multilingual speech-to-text: handle code-switching with AI

The architectural choices, evaluation methodology, and API configuration patterns needed to handle code-switching reliably in production multilingual STT pipelines.

Read article

Key pillars of async contact center AI

To achieve operational scale and competitive differentiation, CCaaS platforms must integrate several interconnected async AI capabilities. The following sections break down the core components of modern contact center AI and how they work together.

1. Post-call transcription and summarization

This is the foundational async use case in modern CCaaS, and typically the entry point for platforms before they expand into downstream AI workflows. The operational logic is straightforward: until every call is reliably transcribed, you cannot score calls, enrich CRM records, detect sentiment, or build any of the analytics layers above.

The gap between legacy and modern async transcription is stark. Legacy AWS-based pipelines often take 10-20 minutes to return a transcript for a one-hour call — meaning agents complete their shift before the transcript is available. Modern async STT returns transcripts in a fraction of the time, unlocking same-shift summaries and near-real-time CRM enrichment.

A well-designed post-call pipeline delivers:

Searchable, diarized transcripts available within minutes of call end
Structured summaries pushed to agents to eliminate manual ACW (after-call work)
CRM enrichment with customer intent, resolution status, and next steps
Multilingual support for international contact center operations
Proper punctuation, casing, and formatting suitable for CRM display

After-call work (ACW) refers to the administrative tasks — notes, CRM updates, call categorization — that agents complete after a customer interaction ends. Reducing ACW through automated post-call summarization is one of the highest-leverage efficiency levers in contact center operations.

2. Transforming call center QA and compliance at 100% coverage

AI-powered QA enables contact centers to automatically monitor 100% of customer interactions, compared to the 1–3% that traditional manual QA teams are able to review — by feeding async transcripts into automated scoring engines that evaluate script adherence, rebuttals, and key phrase coverage.

By pairing async STT with Large Language Models (LLMs) and rule-based scorecards, call centers can now:

Automatically score 100% of interactions based on intent, sentiment, and script adherence
Generate objective performance evaluations the same day — not weeks later
Flag compliance violations across the full call population, not just sampled calls
Identify coaching opportunities across the full agent population
Trigger QA reviews based on keyword or entity detection in transcripts

When combined with accurate async transcription, LLMs analyze call content for meaning, intent, and quality. This shift from sampling to full-coverage QA is one of the most structurally significant changes async AI brings to contact center operations.

Guide 10 min read

Call center QA software: guide to automated quality monitoring

The infrastructure, compliance, and accuracy requirements for moving from sample-based manual QA to 100% automated call coverage in production contact centers.

Read article

Case Study 6 min read

How Selectra is automating quality monitoring of sales calls with speech-to-text AI

How Selectra moved from sampled manual QA to automated scoring across 100% of their sales calls using async transcription and cutting their QA review time by 50%.

Read case study

3. Sentiment analysis and call intelligence

Async call intelligence layers sentiment detection, intent extraction, and conversation analytics on top of transcripts, enabling CSAT prediction, lead scoring, objection detection, and agent performance analytics across the full call population.

Where basic QA evaluates whether an agent said the right words, call intelligence evaluates what actually happened in the conversation: how the customer felt, where the agent lost the deal, which objections recurred, and which moments predict churn.

This capability stack is only as good as the transcript underneath it. Sentiment attributed to the wrong speaker is worse than no sentiment at all — it sends supervisors chasing ghosts. Which means diarization quality, not sentiment model quality, is often the bottleneck in production call intelligence deployments. Well-implemented async call intelligence delivers:

Per-speaker sentiment attribution to distinguish customer frustration from agent delivery
Objection detection to surface why deals stall and which rebuttals work
CSAT prediction from conversational cues, without requiring post-call surveys
Intent extraction to route calls to the right workflows (churn risk, upsell, compliance review)
Trend analytics across agents, teams, and time periods

Guide 12 min read

Enhance customer experience: the ultimate guide to call sentiment analysis

How STT accuracy, diarization, and code-switching support determine whether call sentiment scoring produces reliable coaching, CX, and product signal — or expensive noise.

Read article

4. CRM enrichment and revenue intelligence

Async transcription combined with LLM-based entity extraction automatically populates CRM records with customer intent, decision makers, competitor mentions, next steps, and deal progression signals — eliminating the 90 seconds of manual data entry agents complete after every call.

When post-call transcripts flow directly into an enrichment pipeline, the CRM becomes a live reflection of every customer conversation. The operational impact compounds. Sales managers gain visibility into deal stages without chasing reps for notes. Customer success surfaces churn risk from conversational cues. Marketing learns which objections recur across segments. Every one of these workflows depends on the same foundation: fast, accurate, diarized async transcription that can be reliably parsed by downstream LLMs.

💡 Pro tip: Treat your async STT output as the input to your CRM pipeline, not as a text artifact for humans to read. Formatting, casing, punctuation, and entity tagging matter far more when the next consumer is an LLM than when the next consumer is a person.

Guide 12 min read

Mastering CRM data enrichment: AI & speech-to-text for smarter leads

The pipeline architecture, accuracy testing methodology, and build-vs-buy cost models needed to turn sales call audio into reliable, structured CRM lead data at production scale.

Read article

5. Overcoming language barriers with multilingual async STT

Async speech-to-text paired with machine translation allows global CCaaS platforms to transcribe, summarize, and analyze calls across 10+ languages through a single unified pipeline. As CCaaS platforms shift labor to offshore markets and scale into EMEA, LATAM, and APAC, language diversity becomes a significant operational bottleneck. Running separate QA, analytics, and summarization workflows per language is expensive, fragmented, and impossible to roll up into unified reporting. Combined with multilingual async STT, machine translation creates a unified analytics layer across languages — without requiring multilingual QA staff, multilingual scorecards, or multilingual dashboards. This capability delivers three concrete operational benefits:

Centralized QA: Quality assurance teams can score interactions across languages from a single, unified workflow — using translated transcripts for non-native reviewers and native transcripts for native reviewers.
Unified call intelligence: Sentiment, objection detection, and CSAT prediction can operate across the full call population, not per-language silos.
Scalable international expansion: Platforms can enter new geographic markets without rebuilding their analytics stack from scratch.

Deep Dive 11 min read

Multilingual meeting transcription: language coverage, accuracy, and code-switching challenges

A testing methodology and provider comparison for evaluating multilingual STT on the conditions that actually drive user churn: accented speech, code-switching, diarization on overlapping speech, and all-in cost at scale.

Read article

Technical Breakdown 7 min read

Code-switching in contact centers: why customer calls fail transcription

The technical mechanics of why code-switching breaks standard multilingual ASR — and the architectural patterns that solve it.

Read article

6. Voice compliance in regulated industries

Async transcription of every recorded call (with PII redaction, searchable audit trails, and retention controls) is the foundation of voice compliance in financial services, healthcare, and any regulated industry where spoken interactions create legal obligations.

The regulatory landscape for voice is expanding. MiFID II requires financial firms to record and retain trader voice for seven years. HIPAA requires healthcare contact centers to protect PHI in any form, including transcripts. GDPR requires data minimization and purpose limitation on any personal data, including what's captured in a call recording. Each of these regimes is impossible to comply with at scale without async transcription of the full call population. A compliant async voice pipeline must handle:

PII redaction at the transcription layer, where sensitive data has already been logged (payment details, government ID numbers, and personal contact information, etc.)
Searchable transcripts to support regulator requests and internal audits
Entity-level accuracy on names, account numbers, and jargon that trigger compliance rules
Retention and deletion controls aligned to jurisdiction and industry
Short-call accuracy — critical for trader voice, where calls can be 3–5 seconds long

Deep Dive 6 min read

What is PII redaction? How it works in speech-to-text

A practical breakdown of how PII redaction works inside the transcription pipeline — covering entity detection, marker vs mask strategies, and the GDPR, HIPAA, and PCI presets that make compliance the default rather than a downstream cleanup job.

Read article

7. BPO and outsourcing platform infrastructure

Business Process Outsourcing (BPO) providers increasingly embed async speech AI as a white-label layer resold to their enterprise end-clients, making STT a core part of the BPO's competitive offering, not just an internal tool. Modern BPOs compete on margin, and AI-driven automation is one of the few remaining levers to expand it.

BPO infrastructure has distinct requirements from single-tenant CCaaS deployments:

Multi-tenancy: A single STT integration must serve dozens or hundreds of end-clients with isolated data, configurations, and billing
Reseller-friendly pricing: Margins only work if the underlying STT cost scales predictably with call volume
High concurrency: BPO operations often process millions of minutes per week across many clients simultaneously
SLAs on processing time: Enterprise end-clients demand predictable turnaround, not best-effort delivery
Per-client customization: Custom vocabulary, QA rubrics, and output formats per end-client

From async to real-time: the natural CCaaS upsell path

The async use cases above are the foundation, but they are also the entry point for CCaaS. The typical CCaaS adoption motion follows a predictable sequence:

Async post-call transcription — platform modernizes its transcription layer to unlock faster summaries and CRM enrichment
Automated QA scoring — same transcripts get fed into scoring engines to move from sampled to full-coverage monitoring
Call intelligence and sentiment — analytics layers stack on top of the QA foundation
Real-time agent assist — once async is in production, platforms extend to live transcription for in-call guidance
Voice bots and IVR — the most latency-sensitive tier, typically adopted last

This sequence matters architecturally: platforms that select an STT vendor on async merits alone and later try to bolt on real-time often discover the underlying model wasn't built for sub-300ms streaming. The cleanest upgrade path is a vendor whose async and real-time APIs share the same underlying model, so accuracy, diarization, and entity recognition behave identically across both modes.

💡 Key insight: The async-to-real-time upgrade path is the dominant CCaaS adoption motion — but only if async and real-time share the same underlying model. Bolt-on real-time from a different vendor reintroduces all the accuracy and diarization problems the async migration was meant to solve.

Technical architecture and enterprise scale

Scalable, secure, and easily integrated AI infrastructure is the prerequisite for deploying async voice AI in production CCaaS environments.

Processing at high volumes

Unpredictable API costs and infrastructure complexity are major concerns for high-volume CPaaS and BPO operations. Processing millions of minutes of audio per week requires robust cloud architecture with predictable performance at scale as well as pricing.

A concrete example of what scalable async transcription infrastructure enables in practice: when the voice platform Aircall needed to process over 1 million calls weekly, upgrading their transcription engine allowed them to cut transcription delays from 30 minutes down to just 1.5 minutes — a 95% reduction in processing time that directly accelerated their AI-powered insights pipeline.

Data privacy and enterprise governance

For healthcare, finance, and enterprise contact centers, async STT infrastructure must meet non-negotiable procurement baselines: SOC 2 Type 2 (controls audited over time, not at a single point), ISO 27001 (the international standard for information security management), and built-in PII redaction that strips credit card numbers, SSNs, and account credentials before transcripts reach downstream LLMs, dashboards, or storage. For European CCaaS customers, EU-based data residency is increasingly a hard requirement rather than a nice-to-have. Security cannot be an afterthought — any downstream system that receives unredacted transcripts inherits significant regulatory risk.

Integration with legacy telephony and CRMs

Most CCaaS modernization programs don't start from greenfield. They start from a stack that includes 15-year-old CRM systems, legacy recording platforms, and telephony pipelines built before AI was a consideration. Async speech AI has to integrate into this reality as opposed to requiring a rebuild around it.

Tutorial 10 min read

How to build a no-touch pipeline from sales calls to CRM

A step-by-step build using Gladia for async transcription, Claude for structured entity extraction, and n8n for orchestration — production-ready in under a day, with deal data flowing straight into Attio or HubSpot.

Read article

Preparing your call center for 2026 and beyond

As of 2026, conversational AI deployments are projected to reduce agent labor costs by $80 billion globally — making the transition to AI-augmented contact center operations a strategic imperative, not an optional upgrade.

The next generation of CCaaS will not be strictly human nor fully autonomous. It will be a strategic hybrid. Platforms that invest now in async speech AI infrastructure will be positioned to scale into this hybrid model without rebuilding their stack later. The recommended phased approach looks like the following:

Start with async post-call transcription and summarization: This is the foundational capability. Get every call reliably transcribed and summarized, same-shift, before attempting anything above it. This unlocks ACW elimination, CRM enrichment, and the transcript foundation every downstream AI workflow depends on.
Extend to automated QA and call intelligence: With transcripts flowing reliably, layer in QA scoring, sentiment analysis, and objection detection across 100% of calls. This is where the ROI compounds fastest — you are no longer sampling 3% of calls, you are scoring all of them.
Expand into real-time agent assist and voice bots: Once async is solved, real-time use cases build on the same underlying model and the same diarization and entity recognition guarantees. Platforms that skip the async foundation and try to go straight to real-time often end up re-solving the same problems twice.

The foundation of all three phases is the same: fast, accurate, telephony-native async transcription that handles 8kHz audio, mono-channel diarization, accents, code-switching, and multilingual operations without degradation.

Guide 10 min read

How contact center AI improves efficiency: benchmarks and ROI

A workflow-by-workflow look at where AI delivers efficiency gains in contact centers — routing, QA, summaries, agent assist, forecasting — and why WER and DER set the ceiling on every downstream outcome.

Read article

Best practices

The following practices reflect hard-won lessons from working with leading CCaaS engineering teams deploying async speech AI at scale.

Evaluate STT on your telephony data, not public benchmarks: Public ASR benchmarks are typically measured on clean, high-fidelity audio datasets. Your 8kHz VoIP audio with background noise will perform differently. Always run STT vendor evaluations on a representative sample of your actual call recordings before making infrastructure decisions.

Measure processing time, not just accuracy: A transcript that takes 20 minutes to return for a one-hour call is functionally useless for same-shift summaries, CRM enrichment, or same-day QA scoring. Async processing speed is a first-class vendor evaluation criterion — not a secondary concern.

Prioritize entity-level precision over global word error rate: Word Error Rate (WER) treats all words equally. In contact centers, a mistranscribed account number or customer name causes complete downstream failure — even if overall WER is low. Define custom accuracy metrics for the entities that matter most in your specific workflows.

Treat diarization as a first-class requirement, not a nice-to-have: Every downstream analytics layer — sentiment, QA scoring, call intelligence, objection detection — depends on reliable speaker separation. For mono-channel recording infrastructure, evaluate diarization quality on your actual calls, not on the vendor's marketing samples.

Build PII redaction into the transcription pipeline: PII redaction that happens after transcripts reach your analytics platform is too late. Sensitive data has already been logged, transmitted, and potentially stored. Implement PII redaction at the transcription layer — before transcripts reach LLMs, dashboards, or data warehouses.

Design QA rubrics around transcript-native signals: Automated QA works best when the scoring rubric is built around signals the transcript actually surfaces reliably — key phrases, entities, sentiment, talk-time ratios — rather than subjective signals that still require human review. Start with rubric items that score reliably at scale, then layer in human review for the ambiguous cases.

Plan for mono-channel diarization from the start: If your infrastructure records calls in mono, do not treat diarization as a problem you will solve later. Single-channel speaker separation requires specific model capabilities. Assess your STT vendor's mono-channel diarization performance during the evaluation phase — not after deployment.

Implement a phased rollout: Attempting to deploy transcription, QA, call intelligence, and real-time agent assist simultaneously creates compounding failure modes. Ship async post-call transcription first, measure outcomes, then extend to QA and analytics, and finally into real-time. Each phase provides real-world data to calibrate the next.

Frequently Asked Questions (FAQ)

Production-grade async transcription should return results in a small fraction of audio duration — measured in minutes per hour of audio, not a fraction of that duration. Legacy providers can take 10-20 minutes per hour of audio, which breaks same-shift summaries, same-day QA scoring, and timely CRM enrichment. Modern async STT should keep processing time well below the operational window in which downstream workflows need the transcript — typically within a handful of minutes for a standard support call.

Speaker diarization is the process of automatically identifying and separating different speakers in a recording, and it's hard in contact centers because many legacy systems mix the agent and customer onto a single mono audio channel. Mono-channel diarization requires more sophisticated modeling than dual-channel setups, and every call intelligence workflow above it — sentiment, QA, objection detection — inherits diarization errors directly.

A CCaaS STT vendor should be SOC 2 Type 2 and ISO 27001 certified, with PII redaction built into the transcription API. SOC 2 Type 2 verifies security controls audited over time, ISO 27001 is the baseline information security standard for enterprise procurement, and PII redaction prevents sensitive customer data from reaching downstream systems unmasked. For European deployments, EU data residency is increasingly a hard procurement requirement.

Async transcription produces structured, diarized, entity-tagged transcripts that serve as the input to an LLM-based or rule-based enrichment pipeline. The pipeline extracts fields that matter to the CRM — customer intent, decision makers, competitor mentions, objections, next steps, deal stage signals — and writes them to the corresponding CRM record. The critical dependency is transcript quality: if entities are mistranscribed or attributed to the wrong speaker, the enrichment pipeline writes wrong data to the CRM at scale.

Async STT processes recorded audio and returns transcripts after the call ends; real-time STT streams transcripts during the live call at sub-300ms latency. Most CCaaS platforms should adopt async first — it's the foundation for post-call transcription, summarization, QA, call intelligence, and CRM enrichment, which collectively represent the largest share of CCaaS AI ROI. Real-time use cases like agent assist and voice bots are the natural upsell, but platforms that skip the async foundation often re-solve the same accuracy and diarization problems twice.

Get started with Gladia

Gladia provides the underlying AI audio infrastructure for the world's most innovative voice platforms. It offers fast-processing, multilingual async speech-to-text with industry-leading accuracy, purpose-built for VoIP and SIP pipelines. Built for the scale and precision that contact center operations demand, with a unified model stack.

Get your free API key Get your free API key View documentation View documentation Book a demo Book a demo