Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Pricing
Get started
Get started

Read more

Speech-To-Text

Factors affecting the accuracy of speech-to-text transcripts

TL;DR: Production STT accuracy fails not because of model benchmarks, but because of the gap between studio evaluation audio and the messy, multilingual, overlapping speech real users produce. Four root causes drive that gap: input audio quality, speaker traits (accents, code-switching, and overlap), domain vocabulary deficits, and model training data diversity. WER alone doesn't capture production risk. Semantic accuracy and Diarization Error Rate matter just as much when CRM syncs, coaching scores, and AI summaries all depend on what the transcript gets right. Solaria-1 delivers on average 29% lower WER on conversational speech and 3x lower DER compared to alternatives, benchmarked across 7 datasets and 74+ hours of audio with open, reproducible methodology.

Speech-To-Text

Business call transcript analysis techniques for sales and support teams

TL;DR: Upstream transcription errors compound through every downstream system: LLMs, sentiment models, and CRM pipelines are only as reliable as the transcript they process. Core conversation intelligence techniques, including sentiment scoring, BANT extraction, objection mining, and talk-ratio analysis, all depend on transcription quality. Async/batch processing provides full conversation context, making it the right default for post-call workflows.

Speech-To-Text

How AI contact centers determine caller intent

TL;DR: Caller intent routing fails at the transcription layer long before it fails at the NLU layer. If ASR misreads "cancel" as "candle" due to background noise or a non-native accent, no downstream classifier recovers the routing decision. This article covers the full intent pipeline: ASR, NLU, classification, and routing execution, the latency budgets that constrain real-time systems (~700ms total), and the audio conditions that break most production deployments.

How AI contact centers determine caller intent

Published on May 29, 2026
by Ani Ghazaryan
How AI contact centers determine caller intent

TL;DR: Caller intent routing fails at the transcription layer long before it fails at the NLU layer. If ASR misreads "cancel" as "candle" due to background noise or a non-native accent, no downstream classifier recovers the routing decision. This article covers the full intent pipeline: ASR, NLU, classification, and routing execution, the latency budgets that constrain real-time systems (~700ms total), and the audio conditions that break most production deployments.

Many CCaaS product teams obsess over their LLM routing logic while the transcription layer quietly feeds the model corrupted text. A system that transcribes "I want to cancel" as "I want a candle" because of background noise or a non-native accent will route that call to the wrong queue regardless of how sophisticated the downstream NLU is. The intent pipeline is only as reliable as its first layer.

Determining caller intent requires a precise pipeline: automatic speech recognition (ASR) captures the audio, natural language understanding (NLU) extracts the meaning, and machine learning classifiers or LLMs route the call. This article breaks down how these components interact, the latency budgets required for real-time routing, and how to build audio infrastructure that handles the messy reality of human conversation.

Automating call routing with intent recognition

Traditional IVR systems route callers through fixed menus requiring number presses or keyword matches. AI intent detection replaces that rigid structure with natural language routing. A caller says "I need to dispute a charge from last Thursday," and the system maps that to billing_dispute and routes accordingly without menu navigation.

The shift matters because callers don't describe their problems the way IVR designers expect. They use incomplete sentences, regional expressions, and multiple languages within the same call. An intent system built on modern ASR and NLU handles this variability at scale.

Intent's impact on CX & efficiency

Misrouted calls cost real money. Each misrouted call generates per-call transfer overhead (a transfer, an agent handoff, a caller who restates their problem) and at high call volumes that compounds into a material line item on the operational cost model. Accurate intent detection means fewer transfers, lower average handle time, and less agent idle time waiting for misrouted callers to reconnect.

Beyond cost, misidentified intent breaks the customer journey. A caller expecting help with a billing dispute who gets routed to technical support and transferred twice is the predictable outcome of transcription errors compounding downstream. It's the kind of issue that surfaces through support tickets weeks after deployment rather than internal QA.

Intent detection: real-time or batch

Intent detection splits into two workflows with different latency and accuracy trade-offs. Most CCaaS teams start with batch processing for post-call analytics before layering in real-time capabilities.

Batch processing via REST runs after the call ends. Batch workflows analyze the full recording before producing output, which can support more comprehensive speaker attribution and structured analysis. Latency tolerance is minutes rather than milliseconds, and the structured output feeds QA dashboards, CRM updates, customer journey mapping, and training datasets for future intent models.

Real-time routing processes audio as it streams, classifying intent and directing the call within the conversation. For live call routing, a total pipeline latency target under 700ms keeps the interaction feeling conversational, with the STT layer often consuming a significant portion of that budget. WebSocket connections use an initial HTTP handshake to upgrade the protocol, then maintain an open bidirectional channel that minimizes per-message overhead (2-6 bytes) for ongoing communication. Real-time is the right model for live call routing, agent coaching, and active compliance flagging.

Core components of AI intent systems

The AI intent pipeline passes audio through four functional layers: speech-to-text transcription, entity and meaning extraction, intent classification, and routing execution. Each layer adds latency and each introduces a possible accuracy degradation point.

Transcribing caller audio for intent

ASR converts caller audio into text for downstream NLU processing, and this layer sets the ceiling for everything else. A transcription layer below 10% WER on your audio distribution is a reasonable production target for reliable intent detection. Below that threshold, NLU and LLM models tend to produce more consistent results. Above it, semantic meaning degrades and routing failures multiply.

In contact center environments, WER matters across multiple conditions: background noise in BPO offices, mobile callers on variable-quality connections, non-native speakers with regional accents, and bilingual conversations that switch language mid-sentence. WER on your specific audio distribution, not WER on a benchmark dataset of clean English recordings, is the metric that determines whether your intent pipeline works in production.

NLU for accurate intent processing

Once the transcript exists, the NLU layer extracts semantic meaning from it through three distinct sub-tasks:

  • Entity recognition: Identifying named entities like account numbers, dates, product names, and dollar amounts
  • Intent labeling: Mapping the utterance to a predefined intent category (e.g., payment_inquiry, cancellation_request, technical_support)
  • Confidence scoring: Assigning a probability to the classification for downstream handling of ambiguous cases

A concrete example: "My last payment didn't go through and I want to know why" contains the intent payment_failure_inquiry and an implicit recency signal. A degraded transcript ("My last payment didn't go for and I want to know why") drops entity resolution and may misclassify the intent entirely. Gladia includes named entity recognition as part of its audio intelligence features, so entity extraction runs without a separate API call.

Routing logic and execution

With intent and entities extracted, the system maps NLU output to predefined business logic. A billing_dispute intent with card_type: credit routes to one queue, while billing_dispute with card_type: debit routes to another. If the classifier returns a confidence score below a defined threshold, the system routes to a fallback handler or prompts for clarification rather than committing to a low-confidence classification.

The routing API then directs the call to the correct agent queue, self-service flow, or automated handler. The classified intent and extracted entities simultaneously write to the CRM record, tag the call for QA scoring, and populate the agent's screen before pickup.

Key methods for caller intent detection

Intent classification has evolved through three distinct generations, each with different setup requirements, accuracy profiles, and latency characteristics.

Pattern-based intent classification

Rule-based systems use regex patterns and keyword matching to identify intent. If the transcript contains "cancel," the system triggers the cancellation intent. These systems are fast, simple to configure, and completely predictable.

The limitations are significant: pattern-based systems fail on synonyms ("I want to stop my service" doesn't match "cancel"), implicit intent, and any phrasing the rule author didn't anticipate. They also break when ASR transcription contains errors, because the exact keyword match no longer fires.

How ML models classify caller intent

Traditional machine learning classifiers, including support vector machines and early neural networks, learn intent categories from labeled training data. Given sufficient examples of billing_inquiry utterances, the model generalizes to new phrasing it hasn't seen before.

The trade-off is data dependency. These models require large labeled datasets and perform poorly on new intent categories that weren't in the training set. For contact centers with mature, stable intent taxonomies, ML classifiers remain cost-effective and predictable.

Fast intent extraction with zero-shot LLMs

Modern transformer models extract intent from natural-language descriptions without intent-specific training data. A zero-shot LLM classifies utterances against a taxonomy described in the prompt, enabling teams to add or change intents without retraining.

The trade-off is latency and cost. LLM inference adds meaningful milliseconds to the pipeline, which pushes against the 700ms total budget for real-time routing. Teams typically use a tiered approach: fast ML classifiers handle high-volume, well-defined intents in real time, while LLMs handle ambiguous or novel intents where additional latency is acceptable. For post-call batch analysis, LLM inference latency is not a constraint at all.

Gladia's Audio-to-LLM pipeline structures transcripts and extracted entities into LLM-ready output, so teams route to any model without building the formatting layer themselves.

Intent technique suitability by use case

Method Setup time Latency Best use case
Pattern-based (regex/keyword) Hours Very low Simple, narrow, high-volume intents
ML classifiers (SVM, neural) Moderate (labeled data required) Low to moderate Stable, well-defined intent taxonomies
Zero-shot LLMs (transformers) Fast (prompt engineering) Varies by model Complex, ambiguous, or evolving intents

Designing AI call routing pipelines

Every real-time voice application operates within a latency budget: the total time from audio capture to system response that keeps the interaction feeling like a conversation rather than a processing delay.

Async post-call analysis

Most CCaaS platforms build their analytics infrastructure on batch processing workflows that run after the call ends. Full-recording analysis enables comprehensive QA scoring, accurate speaker attribution for coaching workflows, CRM field population with extracted entities, and the generation of structured training datasets for future intent models. Batch workflows process complete context before producing output, which delivers superior accuracy for speaker diarization, multilingual conversation handling, and entity extraction compared to real-time systems operating under strict latency constraints. For post-call analysis, latency tolerance is measured in minutes rather than milliseconds, which removes the primary constraint that forces real-time systems into accuracy trade-offs.

Setting AI latency targets

For real-time intent routing, a total pipeline budget around 700ms leaves a meaningful buffer before conversational flow breaks. Within that budget, the STT layer often represents the largest fixed cost, with the remaining time allocated to NLU processing, intent classification, and network round trips. If the STT layer consistently exceeds its allocation, the intent pipeline will miss its target regardless of how well-optimized the NLU layer is.

Identifying latency hotspots

Breaking down the 700ms budget reveals where time is typically lost:

  1. Network transit (inbound audio): Varies by geographic distance and connection quality
  2. STT inference: Represents the largest fixed cost in the pipeline, typically several hundred milliseconds for production-grade models on streaming audio
  3. NLU/intent classification: Timing varies by approach, with traditional ML classifiers generally faster than zero-shot LLMs
  4. Routing API execution: Adds latency for webhook calls and external routing logic

The STT inference step is the largest fixed cost in the pipeline and the hardest to compress without sacrificing accuracy. This is why the choice of STT provider has a disproportionate impact on whether the total pipeline stays within budget.

Latency: stream vs. batch data

WebSocket streaming maintains a persistent connection, processing audio chunks as they arrive. After the initial handshake to establish the connection, ongoing messages carry minimal framing overhead, enabling low-latency bidirectional communication. The stateful connection carries audio up and partial transcripts down in parallel, which is what makes real-time routing technically feasible at scale.

REST batch sends a complete audio file once the recording ends. This eliminates persistent connection overhead and reduces per-unit computational cost through parallelization, making it the right model for post-call analysis and QA workflows where latency tolerance is minutes rather than milliseconds.

How Gladia enables caller intent detection

Gladia's API converts raw audio into structured transcripts with word-level timestamps, speaker labels, and extracted entities that feed intent classification models. The primary workflow for most CCaaS platforms centers on post-call analytics: QA scoring, CRM population, customer journey mapping, and structured outputs for LLM pipelines that generate coaching insights and training datasets.

Solaria-1: transcription accuracy for post-call analytics and live routing

Solaria-1 is Gladia's production model, designed to handle noisy environments, accented speakers, and multilingual conversations, the messy reality of contact center recordings where clean studio conditions are the exception. Gladia's async benchmark evaluates Solaria-1 against multiple providers across diverse datasets, showing competitive WER performance on conversational speech and strong diarization accuracy. For intent detection specifically, lower WER translates directly to fewer entity extraction failures and fewer misclassified intents reaching downstream systems. For real-time routing use cases, Solaria-1 delivers <103ms partial-transcription latency and ~270ms average response time. The model supports 100+ languages with native code-switching.

Numerical accuracy matters separately from overall WER for contact centers handling financial data. One fintech customer reported 98.5% numerical accuracy on production audio through Gladia, where a misheard account number or dollar amount corrupts CRM entries and breaks downstream automation regardless of how well the intent was classified.

Speaker diarization for multi-party calls

Gladia's speaker diarization, powered by pyannoteAI's Precision-2 model, is available in asynchronous workflows. Accurate speaker clustering in batch mode benefits from analyzing the full recording before assigning labels.

Post-call diarization enables customer journey mapping and intent resolution analysis by attributing each utterance to caller or agent. This data feeds QA scoring, coaching workflows, and training datasets for future intent models. The speaker diarization webinar covers the technical architecture for teams evaluating it for production workflows.

Gladia intent detection: code workflow

The following TypeScript example shows a WebSocket connection to Gladia's live transcription API and sends the structured JSON output to an intent classification function.

// Step 1: Initialize the session with a POST request
const initResponse = await fetch('https://api.gladia.io/v2/live', {
  method: 'POST',
  headers: {
    'x-gladia-key': 'YOUR_API_KEY',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    language_config: { languages: ['en'], code_switching: true },
  }),
});

const { url } = await initResponse.json();

// Step 2: Connect to the WebSocket using the returned session URL
const ws = new WebSocket('wss://api.gladia.io/v2/live')url);

ws.addEventListener('open', () => {
  ws.send(JSON.stringify({
    type: 'START',
    language_behaviour: 'automatic',
  }));
  streamAudioChunks().forEach(chunk => {
    ws.send(chunk);
  });
});

ws.addEventListener('message', (event) => {
  const result = JSON.parse(event.data);

  if (result.type === 'final') {
    classifyIntent(result);
  }
});

function classifyIntent(transcript: any) {
  // transcript.transcript contains the transcribed text
  // transcript.language contains the detected language
  // transcript.words contains word-level timestamps and confidence
  // Note: This is a minimal example. Production code should include error handling
  // for connection failures, message parsing errors, and connection closure events.
  console.log(transcript.transcript, transcript.language);
}

This example follows the two-step init flow documented in Gladia's live STT quickstart.

A final transcript payload from Gladia's live API returns structured JSON ready for downstream NLU processing:

{
  "type": "final",
  "transcript": {
    "text": "I want to cancel my subscription starting next month",
    "language": "en",
    "confidence": 0.97,
    "words": [
      { "word": "I", "start": 0.00, "end": 0.10 },
      { "word": "want", "start": 0.10, "end": 0.30 },
      { "word": "cancel", "start": 0.50, "end": 0.80 }
    ]
  }
}


This structured output routes directly to an LLM or ML classifier without additional formatting. Full API reference is in Gladia's documentation, and the Gladia SDK walkthrough covers connection setup for teams starting integration.

Async and real-time intent detection

For post-call analytics workflows, Gladia's batch processing via REST delivers structured outputs for QA scoring, CRM updates, and training dataset generation with full-context accuracy on speaker attribution and entity extraction. For teams implementing real-time routing, Gladia's WebSocket integration uses a persistent connection that minimizes per-request overhead after the initial handshake, keeping the STT layer within real-time latency requirements with ~270ms average response time and <103ms on partials, leaving budget available for NLU and routing API calls.

Preventing AI intent system stalls

Even a well-architected pipeline fails in production if it doesn't handle the specific audio conditions of real contact center environments.

Mitigating noisy call audio

Contact center audio is not clean. BPO environments have background chatter, callers phone from mobile devices in public spaces, and hold music occasionally bleeds into active call audio. Models trained exclusively on clean recordings degrade in these conditions in ways that only surface through production error rates, not pre-deployment tests.

Solaria-1 is designed to handle diverse, real-world audio conditions including environmental noise, variable recording quality, and accented speech. Aircall, which processes over 1M calls per week through Gladia, cut transcription time by 95%, from 30 minutes to 1.5 minutes per call, on production contact center audio.

Intent detection in code-switching

Code-switching, the practice of changing languages mid-sentence, is common in multilingual contact center environments. A caller on a Southeast Asian BPO support line might open in English and complete their sentence in Tagalog. Most ASR systems handle this poorly, either returning garbled text for the language-switched segment or requiring a session restart that breaks the real-time pipeline.

Gladia's code-switching support detects mid-conversation language changes automatically across all 100+ supported languages in both real-time and async modes. For best accuracy and latency, providing a small set of expected languages is recommended. When code-switching breaks the ASR layer, downstream intent classification may fail or route to fallback handlers.

Fallback strategies for unclear intent

No intent pipeline achieves 100% confident classification on every call. Three standard fallback patterns handle ambiguous intent:

  1. Clarification prompt: The system plays a targeted prompt asking the caller to restate their need ("It sounds like you may have a billing question. Is that right?")
  2. Human escalation: Calls below threshold route to a general queue or senior agent with the partial transcript and confidence score attached as context
  3. Multi-intent logging: Systems that parse multi-intent utterances route to the primary intent while logging the secondary for follow-up

Rising fallback rates without a change in call volume typically indicate a degradation in ASR accuracy rather than a change in caller behavior, pointing back to the STT layer as the investigation starting point.

Solving intent detection challenges

WER thresholds and STT latency

A WER below 10% on your audio distribution is a reasonable production target for reliable intent classification. Gladia's benchmark methodology shows competitive WER performance on conversational speech. The gap between a system operating near this threshold and one significantly above it is the difference between an intent pipeline that routes reliably and one that generates a constant stream of fallbacks and escalations.

Teams self-hosting open-source ASR models often report WER above 10% on noisy or accented audio, with infrastructure overhead adding DevOps cost on top of the accuracy penalty. Keeping STT latency under 300ms simultaneously preserves the remaining 400ms in the pipeline budget for NLU and routing. Solaria-1's ~270ms latency on streaming audio is benchmarked across diverse audio conditions, not only clean English, as the blind STT model comparison from Gladia demonstrates across multiple audio types.

Can intent models retrain on our call data?

This is one of the most important questions in contact center vendor evaluation and one of the most often buried in contract terms. Most ASR vendors reserve the right to use submitted audio for model improvement by default, with opt-out buried in enterprise addenda. The questions to ask any vendor: what is the default at each pricing tier, and is protection automatic or opt-in. On Gladia's Growth and Enterprise plans, customer audio is never used for model training with no opt-out required. On the Starter plan, data can be used for training by default. Full compliance documentation is at the compliance hub.

Benchmarking intent accuracy at scale

The cost model for STT-based intent detection is linear with audio volume and predictable when pricing is per hour. Gladia includes audio intelligence features such as diarization, named entity recognition, sentiment analysis (text-based, derived from the transcript), summarization, translation, and custom vocabulary in its transcription offerings.

The table below shows projected monthly costs at three volume levels using Gladia's public pricing. Async rates apply to post-call batch processing, and real-time rates apply to live call routing via WebSocket.

Monthly volume Starter async ($0.61/hr) Growth async ($0.20/hr) Growth real-time ($0.25/hr)
100 hours $61 $20 $25
1,000 hours $610 $200 $250
10,000 hours $6,100 $2,000 $2,500

All prices shown in USD.

At higher monthly volumes, the Growth plan offers volume-based pricing with the same all-inclusive features. For a CCaaS platform at enterprise scale, the cost difference is material, with no trade-offs on diarization, NER, or other audio intelligence features. Full details are on the pricing page.

Start with 10 free hours included in the Starter plan each month. Test Solaria-1 on your own noisy, multilingual contact center audio and measure the impact on downstream intent accuracy directly.

FAQs

What WER threshold should an intent pipeline target for reliable routing?

A WER at or below 10% on your audio distribution is a reasonable production target. Errors above that threshold tend to cause semantic meaning to degrade and routing failures to increase. Gladia's async benchmark shows competitive WER performance on conversational speech across diverse datasets.

Does Gladia's speaker diarization work in real-time streaming mode?

Gladia's speaker diarization, powered by pyannoteAI's Precision-2 model, is available in asynchronous workflows, where batch mode benefits from analyzing the full recording for more comprehensive speaker clustering. For contact centers requiring speaker attribution, post-call diarization enables accurate customer journey mapping, intent resolution analysis, QA scoring, coaching workflows, and training datasets for future intent models.

What is the total latency budget for real-time caller intent routing?

A target of under 700ms from first spoken word to routing decision keeps the interaction feeling conversational, with the STT layer representing a significant portion of that budget and the remainder allocated to NLU/LLM classification and routing API execution. Solaria-1 delivers <103ms on partial transcriptions, leaving substantial budget available for downstream processing.

Does Gladia use contact center audio to retrain its models?

On Growth and Enterprise plans, customer audio is never used for model training with no opt-out required, making those the relevant tiers for regulated contact center audio. On the Starter plan, data can be used for training by default. Full details, including SOC 2 Type II and GDPR compliance documentation, are at the compliance hub.

Key terms

Word error rate (WER): The percentage of words in a transcript that differ from the correct transcription, calculated as (substitutions plus deletions plus insertions) divided by total reference words. A WER of 10% means roughly one error per 10-word sentence, which is a reasonable production target for reliable intent classification.

Code-switching: The practice of alternating between two or more languages within a single conversation or sentence. In contact center audio, code-switching breaks most ASR systems that require a fixed language parameter, causing silent transcript failures for the switched segments and corrupting downstream intent classification.

Latency budget: The total time allocated for a complete pipeline operation, distributed across its component steps. For real-time intent routing, systems typically target latency budgets around 700ms or lower, with ASR, NLU classification, and routing API execution as the primary components consuming that budget.

Diarization error rate (DER): A metric measuring the accuracy of speaker attribution in a multi-party transcript, calculated as the fraction of audio incorrectly assigned or left unattributed. Gladia's diarization, powered by pyannoteAI's Precision-2 model, delivers competitive DER performance in async workflows.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more