Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Get started

Speech-To-Text

Call center transcription software: what enterprises should look for in 2026

TL;DR: Most contact centers evaluate transcription software using clean-audio lab benchmarks, then watch QA automation break down when BPO (Business Process Outsourcing) agents switch languages mid-call or phone-line noise degrades the signal. In 2026, the criteria that matter are real-world multilingual WER, all-inclusive per-hour pricing, and data sovereignty that holds up under GDPR and HIPAA audit. For enterprise teams, the highest-ROI evaluation step is testing on real BPO call samples rather than vendor demo audio, and asking every shortlisted provider for an all-in per-hour price with diarization, sentiment, and entity extraction enabled.

Speech-To-Text

PII redaction for call recordings: how ingestion-level redaction keeps calls PCI compliant

TL;DR: Legacy pause-and-resume systems don't remove agents, local desktops, or telephony infrastructure from PCI DSS audit scope. Automated, ingestion-level PII redaction scrubs sensitive data before it reaches any database. By removing cardholder data at the ingestion layer, contact center platforms using automated redaction can potentially reduce audit complexity, cut agent handle time (AHT), and protect downstream CRM and LLM pipelines from corrupt data. The accuracy floor for reliable entity detection in PCI audits is significantly higher than for standard QA transcription, making STT model selection a compliance decision as much as a product one.

Speech-To-Text

GDPR, SOC 2, and ISO 27001 speech-to-text: the contact center compliance and certification guide

TL;DR: When your contact center routes voice data through a transcription vendor, every certification gap in that vendor's stack becomes your compliance liability. Voice recordings qualify as personal data under GDPR Article 4, and processing them through uncertified APIs creates direct financial exposure. This guide breaks down what GDPR, SOC 2 Type II, ISO 27001, HIPAA, and PCI DSS each require of your audio infrastructure vendor and maps those requirements to the QA coverage rates and cost-per-contact metrics you manage daily. We hold GDPR, SOC 2 Type II, ISO 27001, HIPAA, and PCI DSS certifications, and never use customer audio for model training on Growth or Enterprise plan.

Modernizing contact center architecture with AI agents and transcription

Published on May 29, 2026

by Ani Ghazaryan

TL;DR: If you're rebuilding your contact center around AI agents, one architectural decision determines everything else: what sits beneath the AI layer. Transcription is that foundation. Every intent classification, routing decision, agent action, and QA score inherits the ceiling set by what the xSTT layer captures. This article walks through the architectural shift from legacy stack to AI-native design, covering where transcription breaks and why. It covers how platforms like Aircall and Selectra restructured their data layer so downstream.

Rebuilding a contact center around AI agents is an architectural problem, not a feature roadmap problem. AI agents that classify intent, guide conversations in real time, and populate CRM records without human intervention only perform as well as the structured, accurate data flowing into them. Most vendor demos skip straight to the capabilities. The question they don't answer is what those capabilities are built on.

That answer determines scope. AI agents perform only as well as the structured, accurate, low-latency signal they consume. Get the transcription layer wrong and every downstream capability, from agent assist to compliance monitoring to voice agents, caps out at the quality of that broken input.

Update: new model released

Since publishing this article, Gladia has released Solaria-3 — our newest speech model, built specifically for real-world business audio: noisy, fast-paced, and conversational. On production recordings, Solaria-3 ranks #1 across English and core European languages (EN, FR, DE, ES, IT), beating AssemblyAI, ElevenLabs, Deepgram, Mistral, and Speechmatics. It’s also 26% more accurate than Solaria-1 on real English customer calls. That said, the two models are built to complement each other, not compete. Solaria-1 remains the better choice if you need broad language coverage (100+ languages), code-switching support, real-time streaming, or if your audio is clean, formal, or institutional, such as parliamentary recordings. Solaria-3 is the upgrade if your priority is accuracy on European business audio, call center recordings, or anything noisy and conversational. Not sure which to use?

Compare Solaria-1 and Solaria-3 →

See the open-source STT benchmark →

The legacy CCaaS stack wasn't built for AI agents

The anatomy of a traditional contact center stack is straightforward: telephony at the edge, automatic call distribution routing calls by skillset, IVR handling self-service tiers, recording capturing audio to storage, and post-call analytics running on batch exports. That stack answers one operational question efficiently: which agent handles which call?

Legacy stacks treat voice as a recording artifact. The system stores it, QA teams occasionally sample it, and analytics run on batch exports days later if at all. The data it contains, speaker identity, intent, entities named, compliance phrases spoken, stays locked inside the audio file until something transcribes it.

AI agents need the opposite. They need structured, timestamped, speaker-attributed voice data flowing in near real time so large language models (LLMs) can classify intent, retrieve context, generate suggestions, and take actions. A legacy stack that delivers batch transcripts long after a call has ended cannot feed that loop.

Incremental improvements won't close the gap. A faster IVR can't compensate for transcription lagging behind real-time conversation. A better analytics dashboard can't surface coaching signals from transcripts with high word error rate (WER) on accented calls. You have to change the data layer itself, and that change starts with the STT infrastructure sitting beneath everything else.

Where speech-to-text actually sits in a modern CCaaS architecture

Gladia is an audio intelligence API built for the layer between raw voice and every downstream AI capability a CCaaS platform exposes. In an AI-native contact center stack, STT functions as infrastructure, not a feature of any single application. It sits between the voice channel and every downstream AI capability the platform exposes.

The data flow looks like this:

Voice in: Raw audio stream from telephony (SIP, WebRTC, Twilio, etc.)
STT layer: Converts the audio stream to structured transcripts with word-level timestamps, speaker labels, and partial transcripts
AI agent layer: LLM orchestration, retrieval-augmented generation (RAG), intent classification, action-taking, real-time agent assist
Downstream systems: CRM population, ticketing, QA scoring, compliance flagging, analytics pipelines

Every AI capability in that stack consumes output from the STT layer, which means every capability inherits the ceiling set by transcription quality and latency. A wrong name at the STT layer can produce a wrong CRM entry. A missed compliance keyword can produce a missed compliance event. Significant transcription delay produces a voice agent that interrupts at the wrong moment or fails to respond within the conversational window.

The audio-to-LLM pipeline only works as well as the structured output feeding it. Accuracy at the transcription layer isn't optional; it sets the hard ceiling on everything downstream.

What AI agents actually need from the transcription layer

Not all transcription requirements are equal across CCaaS use cases. Voice agents have different constraints than async QA pipelines. Here's what production workloads require at each level:

Requirement	What Gladia delivers	Why it matters architecturally
Accuracy under real conditions	WER on noisy, accented, multi-speaker audio, not lab benchmarks. Solaria-1, Gladia's latest transcription model, delivers 29% lower WER on average on conversational speech.	Every downstream AI capability inherits this ceiling. A wrong entity corrupts every CRM entry, coaching score, and compliance flag flowing from that call.
Latency within conversational budget	~300ms final transcript latency for real-time voice agents. Fast processing for async audio.	Transcription is the dominant variable in total pipeline latency (STT + LLM + text-to-speech + network). Voice agents require low latency to feel natural.
Global language coverage	100+ supported languages including languages other APIs don't cover, with code-switching detection for mid-conversation language shifts.	BPO operations serving global markets need consistent accuracy across languages without maintaining multiple vendor integrations.
Structured output for LLM reasoning	Speaker diarization, word-level timestamps, partials, named entity recognition (NER), confidence scores. Diarization available in async workflows.	Plain text transcripts require post-processing pipelines for downstream AI features. Structured output at the STT layer reduces that work.

‍

What breaks when transcription fails

Transcription failures don't surface loudly. They cascade downstream and appear in different systems under different labels.

Routing failure: A misheard intent leads to wrong ACD routing, which sends the caller to the wrong team, inflates handle time, and generates a repeat call. The STT layer caused the error; the routing system takes the blame.

AI agent failure: When transcription is off, LLMs hallucinate to fill the gap. A voice agent receiving a garbled partial transcript either interrupts at the wrong moment or generates an irrelevant response that breaks the conversational window entirely.

Compliance failure: "I do NOT consent" misheard as "I consent" isn't a transcription edge case. It's a legal and regulatory risk created silently at the STT layer, propagating through every downstream compliance system, and surfacing only when auditors review the record. Missing compliance keywords on accented speech is a known failure mode for STT systems not designed for multilingual robustness.

QA and analytics breakdown: Every coaching score, sentiment flag, and quality assurance output is bounded by transcript accuracy. When the STT layer consistently fails on accented or multilingual calls, your QA team sees confusing performance metrics rather than the root cause.

Proof: how Aircall and Selectra restructured around a better transcription layer

Aircall: from in-house STT to AI-native architecture

Aircall initially built and maintained an in-house transcription engine, which meant engineering capacity went toward maintaining infrastructure rather than building AI features on top of it. Aircall migrated to Gladia's API and cut transcription time by 95%, from 30 minutes to 1.5 minutes per call, and now processes over 1M calls per week through Gladia.

The architectural lesson isn't just the speed improvement. It's what faster transcription enabled. With a reliable, fast STT layer in place, Aircall built searchability across call libraries, automated summaries, sentiment detection, agent coaching, and CRM enrichment as product capabilities rather than infrastructure projects. One API integration replaced the in-house engine and became the shared data layer for every downstream AI feature.

Selectra: from sampled reviews to full-coverage QA

Selectra is a utility comparison platform where call quality directly affects both customer outcomes and compliance standing. After integrating Gladia, Selectra automated QA monitoring across their calls. Their QA team shifted from manually reviewing audio recordings to validating AI findings and acting on the patterns those findings surfaced.

Both cases teach the same architectural lesson: the upgrade wasn't "we added a transcription feature." It was "we restructured what the platform could do downstream by fixing the data layer first."

Architectural choices when modernizing

Build vs. integrate: the real TCO calculation

Building and maintaining a custom STT engine carries infrastructure costs that compound over time: GPU provisioning, model version management, stability monitoring, compliance certification overhead, and the engineering capacity to keep pace with a rapidly advancing model landscape. Teams that move off self-hosted open-source models often see challenges with WER on real contact center audio under production conditions. The Scaling Conversations with 15x ROI session covers this trade-off in depth.

At production scale, managed API pricing at $0.20–$0.75/hour with all audio intelligence features included can be competitive against the GPU and engineering overhead of self-hosting. Check current per-hour pricing for async and real-time rates. Once GPU costs, engineering time, and compliance overhead enter the model, a managed API becomes attractive at realistic call volumes. Teams moving to managed STT APIs often redirect engineering hours back to building product.

Evaluation criteria for selecting an STT vendor

When selecting a transcription API for a production CCaaS stack, these are the criteria that matter in priority order.

STT vendor evaluation checklist:

Production WER: Test on your actual call audio (accented speech, noisy telephony, domain vocabulary), not vendor benchmark datasets
P95 latency: Measure under peak concurrent load for your expected call volume, not average latency under ideal conditions
Language coverage: Verify languages in your customer base and code-switching capability for bilingual markets
Streaming API quality: Check partial transcript latency, stability, and final transcript accuracy under connection variance
Structured output: Confirm word-level timestamps, speaker diarization, NER, confidence scores, and language detection in the base response
Deployment options: Evaluate cloud, on-premises, or air-gapped options for regulated data residency requirements
Compliance certifications: Check for relevant certifications such as SOC 2 Type II, HIPAA, GDPR, ISO 27001 based on your requirements
Data training policy: Confirm whether customer audio is used to retrain models and what the default policy is on paid plans
Integration speed: Measure time from API key to staging environment on your actual infrastructure
Uptime history: Review the public status page and documented SLA your product's reliability depends on

The Gladia compliance hub and benchmark methodology are the starting points for working through the first two criteria against your own audio samples.

Sequence matters: transcription layer first, AI capabilities second

Dependency determines the modernization sequence. AI agent capabilities depend on the transcription layer. Building intent classification, agent assist, or voice agent workflows on a weak STT foundation can mean rework when the foundation is upgraded. The rework cost includes engineering time and the training data, prompt engineering, and fine-tuning done on top of less accurate transcripts.

The sequence:

Establish the transcription layer. Evaluate and integrate an STT API against your production audio. Validate WER, latency, and language coverage on real calls before committing to downstream builds.
Instrument the data layer. Add diarization, NER, and confidence scoring to your transcript output. Build the structured data schema your downstream AI systems will consume.
Build AI capabilities. With a clean, reliable data layer in place, build agent assist, automated QA, summarization, and compliance monitoring as product features rather than infrastructure workarounds.

Architecture is the lever

Modernizing a contact center isn't about adding AI features to a legacy stack. You're rebuilding the data layer those AI features depend on. The CCaaS platforms that emerge from this cycle with durable competitive position will be the ones that treat transcription as foundational infrastructure from the start, not as a bolt-on after the agent and analytics layer is already committed.

Every AI capability you build inherits the ceiling set by what the STT layer captures. Build on a strong foundation and that ceiling is high. Build on a weak one and every downstream investment hits a constraint you didn't address at the right point in the sequence.

If you're evaluating the transcription layer for your CCaaS platform, explore the Gladia API for CCaaS, test Solaria-1 against your own multilingual call audio, and start with 10 free hours to run a proof of concept before the sales conversation.

For teams further along in evaluation, the AI solutions for contact centers guide covers multilingual infrastructure questions, and the Twilio integration documentation covers the telephony layer connection.

FAQs

What is the correct sequence for modernizing a CCaaS platform with AI agents?

Fix the transcription layer first, then build AI agent capabilities on top. Building intent classification, routing logic, or voice agents on a weak STT foundation means rebuilding those features when the data layer is upgraded, compounding engineering cost.

How do I measure transcription accuracy for contact center audio specifically?

Test production WER on your actual call recordings, not vendor benchmark datasets. Benchmark conditions (clean audio, native speakers, controlled vocabulary) typically produce significantly better WER than real contact center audio with accented speech, background noise, and overlapping speakers.

What latency does the STT layer need to support real-time voice agents?

Fast final transcript latency is the target for voice agent use cases. Total pipeline latency accumulates across STT, LLM inference, TTS, and network overhead, and conversations require low latency to feel natural.

Does speaker diarization work in real-time transcription?

Gladia's speaker diarization is available in async workflows. For real-time use cases, handle speaker attribution in post-processing to maintain accuracy.

What is the cost difference between building in-house STT and integrating a managed API?

Self-hosted STT infrastructure can be costly when GPU provisioning, engineering time, compliance certification, and maintenance overhead are included. Managed API pricing at scale includes all audio intelligence features at the base rate.

How does code-switching affect transcription accuracy for global contact centers?

Many STT APIs struggle when speakers shift languages mid-conversation, returning inconsistent output or defaulting to a single detected language for the entire transcript. Code-switching detection across 100+ languages is important for contact centers serving bilingual speaker populations.

What compliance certifications should a CCaaS transcription vendor hold?

Check for relevant compliance certifications based on your requirements. For SaaS vendors handling regulated data, look for certifications such as SOC 2 Type II and ISO 27001, along with GDPR and HIPAA compliance where applicable. Review the vendor's default data training policy on paid plans to understand how your audio data is handled.

Key terms glossary

Word Error Rate (WER): The percentage of words a speech-to-text system transcribes incorrectly. Production WER on real contact center audio typically runs significantly higher than benchmark WER on clean datasets.

Speaker diarization: The process of partitioning an audio stream into segments by speaker identity and labeling each segment (Speaker 0, Speaker 1, etc.). Useful for multi-speaker contact center conversations where distinguishing agent from customer turns can improve downstream AI accuracy.

Latency budget: The total time available for a system to respond before user experience degrades, measured end-to-end across all pipeline components. For voice agents, low latency across STT, LLM inference, text-to-speech, and network is important for natural conversation.

Code-switching: Alternating between two or more languages within a single conversation or utterance. Common in bilingual contact center environments and a challenge for STT systems that lack mid-conversation language detection.

P95 latency: The 95th percentile latency value in a distribution of response times. Often used for capacity planning because it captures tail performance rather than just average behavior.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Speech-To-Text

Call center transcription software: what enterprises should look for in 2026

Speech-To-Text

PII redaction for call recordings: how ingestion-level redaction keeps calls PCI compliant

Speech-To-Text

GDPR, SOC 2, and ISO 27001 speech-to-text: the contact center compliance and certification guide

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Read more

Call center transcription software: what enterprises should look for in 2026

PII redaction for call recordings: how ingestion-level redaction keeps calls PCI compliant

GDPR, SOC 2, and ISO 27001 speech-to-text: the contact center compliance and certification guide

Modernizing contact center architecture with AI agents and transcription

The legacy CCaaS stack wasn't built for AI agents

Where speech-to-text actually sits in a modern CCaaS architecture

What AI agents actually need from the transcription layer

What breaks when transcription fails

Proof: how Aircall and Selectra restructured around a better transcription layer

Aircall: from in-house STT to AI-native architecture

Selectra: from sampled reviews to full-coverage QA

Architectural choices when modernizing

Build vs. integrate: the real TCO calculation

Evaluation criteria for selecting an STT vendor

Sequence matters: transcription layer first, AI capabilities second

Architecture is the lever

FAQs

What is the correct sequence for modernizing a CCaaS platform with AI agents?

How do I measure transcription accuracy for contact center audio specifically?

What latency does the STT layer need to support real-time voice agents?

Does speaker diarization work in real-time transcription?

What is the cost difference between building in-house STT and integrating a managed API?

How does code-switching affect transcription accuracy for global contact centers?

What compliance certifications should a CCaaS transcription vendor hold?

Key terms glossary

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.