API Comparison Table

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Get started

Speech-To-Text

Speech-to-text for AI medical scribes: Why clinical vocabulary breaks generic STT

TL;DR: Generic STT engines fail in clinical environments because language model probability overrides correct acoustic detection of medical terms, substituting phonetically plausible but clinically wrong candidates silently. The result corrupts drug names, dosages, and diagnoses before the LLM ever sees them. Before selecting an STT engine for a medical scribe, verify four things: whether vocabulary biasing works at inference time without fine-tuning, whether async diarization accurately separates clinician and patient audio, whether the model holds up on noisy consultation recordings rather than clean read-speech, and whether the vendor's data training policy covers PHI by default on your plan.

Speech-To-Text

Migrating from self-hosted Whisper to a managed speech-to-text API

TL;DR: Self-hosting Whisper's true cost rarely sits in the model weights. GPU idle time, VRAM leaks under parallel load, and the engineering hours spent maintaining CUDA dependencies and diarization pipelines are where the bill compounds. For teams processing under roughly 3,000 hours per month, assuming 20% of one US FTE at $150K loaded annual cost, a managed API is cheaper, though the break-even shifts materially against your actual labor cost. Above that threshold, the decision depends on your DevOps overhead and whether audio accuracy on real-world recordings matters for downstream systems like CRM sync and coaching scores.

Speech-To-Text

Migrating from AssemblyAI to Gladia: A step-by-step switching guide

TL;DR: Switching from AssemblyAI requires four concrete changes: update one auth header, remap batch endpoints, adjust the JSON response schema, and resample audio for WebSocket connections. Multiple customers independently report completing these in under a day with a rollback abstraction layer in place. The bigger structural difference is cost model: a production stack with diarization, sentiment, entities, and summarization runs $0.30/hr on AssemblyAI's Universal-2 tier because each feature is metered separately, versus a bundled base rate. This guide covers the exact parameter mappings, payload diffs, WebSocket reconfiguration, and a zero-downtime cutover strategy.

Real-time transcription for contact centers: what latency and accuracy thresholds matter

Published on June 19, 2026

by Ani Ghazaryan

TL;DR: Real-time STT for contact centers requires sub-300ms latency to match the natural 200-300ms window of human conversation, but raw speed without accuracy breaks the product. The latency budget is cumulative: audio capture, STT inference, NLU, and TTS each consume a slice. Partial transcript stability matters as much as final output speed: IVR routing and agent assist act on intermediate text before the transcript locks, so instability at that layer compounds into wrong-queue transfers and missed coaching prompts.

Most product teams building live voice products obsess over raw latency, run a benchmark on clean English audio, and ship. Three months later, the same teams find agent assist surfacing irrelevant knowledge base articles, IVR misrouting non-English-speaking callers, and CSAT scores moving in the wrong direction. The transcript was arriving fast, but it was also wrong.

This is the core tension in real-time contact center transcription: fast output that's inaccurate doesn't help an NLU model route a call or surface a coaching prompt. It actively harms the product. The metric product leaders need to optimize for isn't raw speed in isolation. It's the time to a stable, actionable transcript under the real conditions of contact center audio. This article breaks down the exact thresholds, the latency budget mechanics, and what to measure when evaluating STT providers for live contact center use cases. For post-call analytics, compliance workflows, and high-accuracy speaker attribution, the async pipeline is the stronger choice. This article focuses specifically on the real-time layer and the conditions where real-time is the core use case.

Why contact center transcription has different performance requirements

Real-time transcription isn't a faster version of batch processing but a fundamentally different operating mode where the model streams partial outputs incrementally as speech arrives, rather than waiting for a complete audio file to process. For contact center applications, this distinction matters because downstream systems act on those partial outputs before the final transcript locks.

Real-time agent assist and IVR routing

Agent assist tools surface relevant knowledge base articles, compliance prompts, and suggested responses while the caller is still speaking. The entire value proposition collapses if the STT layer buffers for 800ms before emitting a partial transcript. Any transcription error forces the system to ask the caller to repeat themselves or generate a response based on misunderstood input, adding meaningful latency that compounds across a call volume of millions per month.

IVR routing creates a tighter constraint: the routing decision happens before the caller finishes speaking. If the partial transcript for "I want to cancel my subscription" flips between plausible interpretations before settling on the correct text, the routing model either waits for stability (adding latency) or acts on bad data (routing to the wrong queue). Partial transcript stability, which measures how often intermediate outputs change before finalization, is a critical accuracy dimension that most vendor benchmarks ignore entirely.

For teams migrating from existing providers, we provide a migration guide from Deepgram and a migration guide from AssemblyAI to reduce switching friction.

Detecting intent in live calls

Real-time intent detection requires clean, stable text to fire correctly before the caller disengages. The code-switching guide for contact centers documents how frequently this breaks in global contact centers where callers switch languages mid-sentence. Intent detection trained on English-only transcripts fails to classify multilingual input accurately, regardless of how fast the STT layer runs.

The 300ms threshold and why it governs the full pipeline

The 300ms threshold isn't an arbitrary engineering target. In natural conversation, response gaps are generally observed within a 200-300ms window, with turn-taking pauses occasionally extending to 700ms. When your voice product exceeds that window, callers stop experiencing a conversation and start experiencing a delay, and the system's technical sophistication becomes irrelevant.

Table 1: Real-time STT evaluation metrics for contact centers

Metric	Definition	Why it matters for CCaaS
TTFS (Time to First Segment)	Delay from speech start to first partial transcript	Determines perceived responsiveness for agent assist and IVR
P50 latency	Latency at which 50% of requests are processed faster	Reflects typical caller experience, not best-case performance
WER	(Substitutions + deletions + insertions) / reference word count	Measures raw transcription accuracy on your audio, not studio recordings
Partial transcript stability	Frequency and magnitude of changes before final output locks	IVR routing and intent detection act on intermediate text before the final output locks. High instability increases the risk that a routing decision fires on text that has not yet settled
Code-switching accuracy	WER on audio containing mid-conversation language changes	Critical for global contact centers with multilingual callers

‍

Minimizing perceived latency and boosting agent productivity

TTFS measures the delay between a spoken word and its first appearance as text. P50 latency reflects what a typical user experiences: half of all requests process faster than the P50 value and half slower. These two metrics together define perceived responsiveness. TTFS drives the experience of the caller who needs to know the system is listening, while P50 determines whether agent assist or IVR logic can act within the natural pause window.

When STT latency stays low, agents receive live transcripts during the call for automated CRM population and coaching prompts, eliminating after-call note work.

Preventing call drop-offs

When systems respond slowly, callers talk over the bot, repeat themselves, or abandon the call entirely. Delayed responses cause customers to hang up and agents to interrupt the AI, and the compounding cost is significant: a single extra second per interaction multiplied across millions of monthly calls produces measurable CSAT degradation and increased Average Handle Time (AHT).

Breaking down the real-time latency budget

No single component owns the latency problem. A well-optimized voice assistant pipeline stacks roughly five contributions: voice activity detection and audio capture, network upload, STT inference, LLM processing, and TTS first-chunk delivery plus transport overhead. STT is one slice, not the whole budget, and it must stay small enough to leave room for the layers above and below it.

Audio capture and chunking typically introduce an initial delay of 50–100ms before the STT model has seen any input. Smaller audio chunks reduce this initial delay but provide less phonetic context per inference step, increasing the probability of errors on ambiguous words or accented speech. Chunk size is a variable that documentation defaults can only approximate: the correct value depends on your caller demographic, accent distribution, and telephony stack, and must be validated against real samples from your production queues.

Real-time STT processing and NLU budgets

STT processing can spike well beyond sub-300ms targets when the model has to re-process previous tokens to correct an earlier partial transcript. Solaria-1 achieves approximately 270ms overall responsiveness. Confirm final-transcript delivery timing against your specific audio conditions and WebSocket configuration before setting production SLOs.

The LLM or NLU layer that processes STT output adds its own time-to-first-token cost. A realistic voice assistant budget might allocate roughly 400ms to LLM processing alone, an industry estimate that will vary by model and infrastructure, which means the STT layer's total contribution should stay comfortably under 300ms to leave the rest of the budget intact. Any STT provider that regularly burns 400–500ms on transcription is consuming budget that belongs to inference.

Agent response latency

The final TTS or system response typically adds in the region of 150ms for first-chunk delivery plus network transport overhead. These are industry estimates rather than fixed constants, meaning the 300ms conversational threshold governs the entire pipeline, not just one component. As a rough illustration: if STT takes 300ms and NLU takes 400ms and TTS takes 150ms, the caller experiences an 850ms gap before hearing a response. Building to a sub-300ms STT budget leaves the NLU and TTS layers room to operate without degrading the caller experience.

Accuracy requirements in contact center audio environments

Speed matters once the transcript is accurate enough to act on. A 200ms transcript that reads "cancel my border" instead of "cancel my order" is not actionable. The contact center audio environment introduces several accuracy challenges that clean benchmark datasets miss entirely.

Word error rate on call center audio

Contact center telephony operates at 8 kHz narrowband, removing frequency content above 4 kHz. That bandwidth carries formant information critical for phoneme discrimination. When narrowband audio arrives with added reverberation, speaker overlap, and noise, WER degrades significantly even on models that test well on clean audio. Voice compression used in VoIP systems removes additional acoustic detail to reduce bandwidth costs, compounding the phoneme discrimination problem. Benchmarks run on studio-quality audio don't predict production WER in a contact center environment. Only testing on your actual call recordings does.

Code-switching accuracy metrics

Mid-call language changes break most STT models because the underlying architecture assumes a fixed input language per session. When a caller in a Southeast Asian contact center switches from English to Tagalog mid-sentence, a model without native code-switching support either fails silently or returns garbled text. The NLU layer then attempts intent detection on input that contains no reliable signal. The technical mechanics of code-switching in ASR explain why this isn't a minor edge case: it's the default behavior in any global contact center with bilingual caller bases. Supporting code-switching across 100+ languages means the STT layer stays with the caller through the entire conversation.

Real-time STT in loud environments

Open-plan contact center floors, background chatter, keyboard clicks, and poorly isolated headsets all contribute noise that degrades real-time WER. Models trained and tested only on clean audio have no learned robustness for these conditions. The pipecat STT benchmark framework provides methodology for measuring partial transcript stability and latency on realistic audio conditions. Run any candidate STT provider through it before committing to production.

Diarization accuracy for multi-speaker calls

High-accuracy speaker attribution (diarization) depends on full-context analysis and is best achieved through post-processing workflows rather than real-time streams, particularly for compliance and analytics use cases. Real-time systems can provide interim speaker labels or rely on channel separation when calls are recorded on separate audio tracks, but once a speaker label is assigned in a real-time stream, that assignment is permanent with no ability to revise it later.

Why faster models aren't always better in production

The tension between latency and accuracy in streaming models is real and structural. It stems directly from audio chunk size: smaller chunks reduce time-to-first-partial but provide the model with less phonetic context per inference step, which increases the probability of errors on ambiguous or accented words. Streaming transcription operates with significantly less audio context than batch processing, which causes measurable accuracy degradation on alphanumeric strings, proper nouns, and domain-specific vocabulary.

Lightweight models optimized to run fast in demo conditions regularly underperform in production contact center environments. A transcription error forces a clarification exchange that adds meaningful latency to the conversation, which is far more costly than a heavier, more accurate model would have introduced.

The practical framework: if the use case is agent assist or real-time coaching, partial transcript stability at 200-300ms matters more than perfect final accuracy, because agents can scan an imperfect partial while the system corrects itself. If the use case is IVR routing where a misclassification results in a wrong queue transfer, final accuracy at or below 300ms is the target and partial stability is the enabling condition.

Achieving sub-300ms latency with Solaria-1

We built Solaria-1 to handle the multilingual, accented, noisy audio that contact centers generate, not the clean datasets that make benchmark numbers look clean.

Solaria-1 real-time capabilities

Solaria-1 is Gladia's production model built for multilingual robustness and code-switching support, and it performs well in real-time workloads. It delivers approximately 270ms overall responsiveness via WebSocket, with partials emitted in under 103ms. This positions it competitively for conversational AI and live agent assist, though not as the absolute lowest-latency option on the market.

Measure final-transcript delivery timing on your own audio and telephony stack before committing production SLOs. Network conditions, chunk size, and caller demographics all shift observed latency in ways that benchmarks on clean audio cannot predict. Follow the WebSocket streaming walkthrough or the React TypeScript integration tutorial to see a real implementation.

Language coverage for global contact centers

Solaria-1 covers 100+ languages, including 42 that no other API-level STT competitor supports, among them Tagalog, Bengali, Punjabi, Tamil, Urdu, Persian, and Marathi. For contact centers running BPO operations in Southeast Asia, South Asia, or Latin America, the language coverage gap between providers is a direct operational risk.

"Excellent multilingual real-time transcription with smooth language switching... Superior accuracy on accented speech compared to competitors... Clean API, easy to integrate and deploy to production." - Yassine R. on G2

How to evaluate real-time transcription for your use case

Running a rigorous vendor evaluation on real-time STT takes less time than recovering from a bad production deployment. Here's what to measure and how.

Validate against live contact center audio

Your test set must reflect the audio your production system processes. Anonymize real calls from your production queues for PII but keep them acoustically authentic. Include samples across these categories:

8 kHz narrowband recordings from your telephony system
Calls with background noise, hold music bleed, and open-plan chatter
Alphanumeric input sequences (policy numbers, account IDs, confirmation codes)
Multi-speaker segments with varying overlap ratios
Accented speech from your actual caller demographic, segmented by region

STT models handling telephony audio frequently mis-transcribe numerical sequences — policy numbers, account IDs, and dollar amounts are precisely where a wrong digit creates a downstream error that stays invisible until a CRM record or fraud alert surfaces it. Do not use studio recordings, read-speech datasets, or the vendor's own benchmark audio.

Verify sub-300ms latency targets and manage cost at scale

Measure TTFS and P50 latency in your own network environment, not the vendor's. Network upload time adds 20-100ms depending on your telephony stack and geographic distance to the inference endpoint, affecting your production numbers regardless of what the provider benchmarks. Set SLOs with P95 targets, not just P50, because tail latency in the 5% worst case is where callers actually experience the failure. A P95 TTFS at or below 300ms is a reasonable starting point for conversational use cases.

On the cost side, add-on pricing structures can break a model that looked conservative at the proof-of-concept stage. Diarization, named entity recognition, sentiment analysis, and translation are each priced as separate line items by some providers. Pricing includes all of these in the base rate on Starter and Growth plans. The per-hour model removes metering complexity from your unit economics: real-time transcription starts at $0.75/hr on Starter and as low as $0.25/hr on Growth, scaling directly with audio duration. On Growth and Enterprise plans, customer data is never used for model training and no opt-out action is required.

Conduct a real-world pilot

Ship SLOs before committing to a vendor, not anecdotes. Target P95 TTFS at or below 300ms and P95 final transcript delivery at or below 800ms for 3-second utterances, measured in production-equivalent conditions. Run the pilot against a subset of live traffic rather than a static test set, because live caller behavior introduces variability that recorded samples can't replicate. Multiple product teams report sub-24-hour integration timelines from API key to production traffic, meaning the evaluation cycle doesn't have to consume an engineering sprint.

Key takeaways on latency in real-time STT models

The latency budget for real-time contact center applications is cumulative and unforgiving. Audio capture typically consumes 50–100ms before the STT model sees any input. STT inference must stay under 300ms. NLU time-to-first-token may add ~400ms depending on model and infrastructure. TTS first-chunk delivery may add ~150ms. The non-STT figures are industry estimates, measure your own stack under production conditions. An STT provider that burns 400ms on transcription alone isn't just slow: it's consuming budget that belongs to the layers your product surfaces to users.

Unstable partial transcripts cause IVR systems to route calls based on intermediate text that changes before finalization, generating wrong-queue transfers that accumulate into measurable CSAT damage at scale.

For compliance-sensitive contact centers requiring verified speaker attribution, the correct architecture uses interim real-time labels for live agent guidance and performs high-accuracy diarization in post-processing on the full recording. Gladia's diarization, powered by pyannoteAI's Precision-2 model, is available in async workflows.

Offshore BPO operations serving English-speaking markets generate accented English audio that most models built for American English handle poorly. The WER gap between a clean-audio benchmark and real-world call center audio can reach 10-20 percentage points.

Start with 10 free hours and test Solaria-1's real-time transcription on your own contact center audio via WebSocket. Most teams are live in under a day, and Gladia engineers are available on Slack if the integration surfaces edge cases in your specific audio environment.

FAQs

What is a good latency benchmark for real-time STT?

Sub-300ms is the target for natural conversational flow in voice agents and contact center applications, matching the 200-300ms natural pause window in human conversation. Anything consistently above 500ms introduces noticeable lag that causes callers to talk over the system or abandon the call.

What is Time to First Segment (TTFS)?

TTFS measures the delay between a spoken word and its first appearance as text in the output stream. It is a critical metric for contact center real-time applications because NLU and routing logic begin acting on partial transcripts before the final segment locks.

How does code-switching break real-time IVR routing?

When a caller switches languages mid-sentence and the STT model lacks native code-switching support, the partial transcript degrades into garbled or mixed-language output, causing the IVR routing model to fire the wrong intent classifier and transfer the caller to the wrong queue. Gladia's code-switching documentation covers how to configure detection across the full 100+ language set.

What is the latency contribution of each pipeline component?

A well-optimized 2026 voice assistant pipeline might distribute latency roughly as follows: audio capture and voice activity detection (50–100ms), STT inference (270ms overall responsiveness with Solaria-1, with final-transcript delivery to be measured on your own stack), NLU time-to-first-token (400ms), and TTS first-chunk delivery (~150ms). The non-STT figures are industry estimates and will vary by model, infrastructure, and geography, measure your own stack in production conditions. STT must stay under 300ms to leave NLU and TTS budget intact.

Key terms glossary

P50 latency: The latency value at which 50% of transcription requests are processed faster than the given time. It reflects the typical caller experience rather than best-case or worst-case performance.

Partial transcript stability: The consistency of transcribed text before the final output is locked. Low stability, meaning frequent changes to intermediate text, causes IVR routing and intent detection to act on incorrect inputs before the transcript finalizes.

Code-switching: The practice of alternating between two or more languages mid-conversation, common in global contact centers. Models without native code-switching support fail silently or return garbled output when callers switch languages mid-call.

TTFS (Time to First Segment): The delay between a spoken word and its first appearance as text in the real-time output stream. A key metric for evaluating how quickly downstream NLU models can begin processing intent.

WER (Word Error Rate): Calculated as (substitutions + deletions + insertions) divided by the total reference word count. Always report with a specific language, audio condition, and dataset to be meaningful.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Speech-To-Text

Medical speech-to-text for AI scribe builders

Speech-To-Text

Migrating from self-hosted Whisper to a managed speech-to-text API

Speech-To-Text

AssemblyAI to Gladia migration guide: API mapping & setup

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Read more

Speech-to-text for AI medical scribes: Why clinical vocabulary breaks generic STT

Migrating from self-hosted Whisper to a managed speech-to-text API

Migrating from AssemblyAI to Gladia: A step-by-step switching guide

Real-time transcription for contact centers: what latency and accuracy thresholds matter

Why contact center transcription has different performance requirements

Real-time agent assist and IVR routing

Detecting intent in live calls

The 300ms threshold and why it governs the full pipeline

Minimizing perceived latency and boosting agent productivity

Preventing call drop-offs

Breaking down the real-time latency budget

Real-time STT processing and NLU budgets

Agent response latency

Accuracy requirements in contact center audio environments

Word error rate on call center audio

Code-switching accuracy metrics

Real-time STT in loud environments

Diarization accuracy for multi-speaker calls

Why faster models aren't always better in production

Achieving sub-300ms latency with Solaria-1

Solaria-1 real-time capabilities

Language coverage for global contact centers

How to evaluate real-time transcription for your use case

Validate against live contact center audio

Verify sub-300ms latency targets and manage cost at scale

Conduct a real-world pilot

Key takeaways on latency in real-time STT models

FAQs

What is a good latency benchmark for real-time STT?

What is Time to First Segment (TTFS)?

How does code-switching break real-time IVR routing?

What is the latency contribution of each pipeline component?

Key terms glossary

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.