Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Sign up

Get started

Call center voice analytics: use cases, benefits, and how it works

TL;DR: Contact centers that rely on manual QA for call review typically sample only a small fraction of their total call volume, leaving the vast majority of audio unanalyzed. Voice analytics fixes this by converting raw phone calls into structured, LLM-ready data that feeds QA scorecards, CRM entries, and coaching workflows automatically. The catch is that telephony audio is uniquely hostile to standard speech APIs because narrowband codecs and packet loss break models trained on clean audio. This article explains the technical pipeline, the metrics that matter, and the infrastructure requirements that separate production-ready systems from vendor demos.

Speech-To-Text

Customer sentiment analysis: methods, tools, and what voice data adds

TL;DR: Reliable sentiment analysis requires WER below 5%, speaker diarization that separates customer and agent emotion, and language models that hold performance across accents and code-switching. Text-only sentiment tools miss critical voice signals (pace, talk-over, vocal intensity) that predict churn before survey data surfaces the same risk. Automated sentiment scoring on high-accuracy transcripts shifts QA from sampling 2–5% of calls to monitoring 100% of them, the only coverage level at which churn risk and agent burnout surface early enough to act on.

Speech-To-Text

Named Entity Recognition from call transcripts: improving precision

TL;DR: Standard NER models trained on clean text lose up to 27 F1 points when applied to raw ASR output. For CCaaS operations running automated QA and CRM sync, that gap translates directly into missed account numbers, corrupted customer records, and unreliable coaching scores. The fix starts at the transcription layer. Our Solaria-1 model delivers lower WER on conversational speech and 3x lower DER than alternatives, giving your NER pipeline a clean text foundation before a single field is written to the CRM.

Gladia x Rime I Building better CX agents with STT and TTS

Published on Dec 23, 2025

Jean-Louis Quéguiner

Gladia x Rime I Building better CX agents with STT and TTS

Despite rapid advances in voice agent stacks, including speech-to-text (STT), text-to-speech (TTS), and large language models (LLMs), the real-world promise of fully autonomous voice assistants remains largely unmet.

In a recent webinar, our CEO, Jean-Louis, chatted with Lily Clifford, a speech technology researcher and founder of a TTS-specialized Rime, to unpack what’s working, what isn’t, and what truly matters when building voice agents that deliver an on-call experience that matches user expectations.

You can watch our conversation below or read on practical summary of key learnings from the field.

Why human-like TTS can hurt voice agent performance

Since the earliest days of TTS, the end goal for many teams has been the same: make the voice sound human, warm, and natural. But one of the most surprising insights from the discussion was that this goal can sometimes backfire.

As Lily put it during the webinar:

“Overwhelmingly, every point of data that we have shows that if you hear professional voice actor TTS, you’re more likely to hang up.”

In demos, highly expressive voices often sound impressive. But in real customer interactions, especially over the phone, they can feel unnatural or even suspicious.

Another important point Lily made was about first impressions:

“If you hang up after the first thing the agent says, you never get to experience how good the system actually is.”

In other words, the voice sets expectations instantly. If that expectation feels wrong, the conversation ends before the intelligence of the system ever has a chance to show.

STT and TTS accuracy vs quality: why precision matters more

Another recurring theme from the webinar was the difference between accuracy and quality — two concepts that are often conflated.

When discussing STT and TTS performance, teams usually focus on:

Quality, meaning how natural or fluent the output sounds
Accuracy, meaning whether the output is correct

But in production voice agents, there’s a more important distinction: precision.

As Jean-Louis explained during the webinar:

“You can be 99% accurate, but if the only thing you get wrong is the first name, that’s enough to kill the experience.”

Precision on critical entities matters far more than global averages — and this is something many evaluation methods fail to capture.

Why STT benchmarks don’t reflect real-world voice agents

Public ASR benchmarks are useful for research and high-level comparison, but they often fall short when applied to real-world voice agents.

Enterprise voice systems don’t operate on clean, labeled datasets. They operate in environments with:

Emerging or proprietary brand names
Customer names and addresses
Email addresses and alphanumeric identifiers
Code-switching between languages
Noisy audio and telephone artifacts

One example discussed in the webinar highlighted how benchmarks can even penalize correct behavior. In bilingual calls, a model may accurately transcribe speech in multiple languages, while the benchmark’s ground truth labels everything as “foreign language.” In this case, the model is right — but the evaluation framework is wrong.

This leads to a dangerous outcome: models that are optimized for benchmarks rather than for customer impact. When benchmarks don’t measure what matters in production, they incentivize the wrong trade-offs.

Evaluation should reflect business outcomes, not just synthetic test scores.

Voice agent design differences between inbound and outbound calls

Not all voice interactions are the same, and one important distinction is whether a call is inbound or outbound.

In inbound calls, the system has little to no prior context. The caller could be reaching out for many different reasons, which means broader vocabularies, more ambiguity, and greater tolerance for exploratory interaction.

Outbound calls are very different. The system usually knows who it is calling and why. In these scenarios, users are far less tolerant of latency or misunderstanding. Precision and speed become critical, especially when confirming names, appointments, or transactions.

Designing effective voice agents requires acknowledging these differences and configuring STT, TTS, and orchestration logic accordingly.

Latency in voice agents is about perception, not just speed

Latency is often treated as a purely technical metric, measured in milliseconds. But in conversational systems, latency is as much about perception as it is about speed.

If a system responds too slowly, users feel like they weren’t heard. If it responds too quickly, it can feel unnatural or interruptive. What matters is conversational rhythm — the timing that aligns with human expectations of turn-taking.

Optimizing latency isn’t about minimizing it at all costs. It’s about making responses feel timely and intentional within the flow of conversation.

Why A/B testing is essential for STT and TTS systems

The most successful voice teams don’t rely on intuition or static configurations. They test.

As Lily shared:

“Some of our customers are running forty different voices in production and seeing which one reduces abandonment.”

Based on her experience, most voice teams today run experiments across:

Different voices and speaking styles
Turn-taking strategies
Latency thresholds
Precision rules for critical entities

Small changes in these parameters can have an outsized impact on conversion rates, call completion, or customer satisfaction. At scale, even a one percent improvement can translate into significant business value.

For voice agents, experimentation isn’t optional — it’s foundational.

Voice agents are user experiences, not just STT and TTS models

It’s tempting to think of voice agents as pipelines: STT feeds into an LLM, which feeds into TTS. But the reality is far more complex.

Voice agents are user experiences. Many of the hardest problems don’t live inside model weights — they live in orchestration, integration, evaluation, and design.

Building better voice agents requires systems thinking, continuous measurement, and a deep understanding of how humans perceive and interact with voice interfaces.

How successful teams build production-ready voice agents

From the patterns discussed in the webinar, high-performing teams tend to:

Prioritize precision over perceived realism
Design differently for inbound and outbound interactions
Test and iterate continuously in production
Measure success through business metrics like conversion and churn
Treat evaluation as a product problem, not just a modeling task

These teams focus less on theoretical perfection and more on real-world performance.

Conclusion: building better voice agents with STT and TTS

If there’s one takeaway from this discussion, it’s this: in voice interfaces, “good enough” is not good enough.

Accuracy, precision, latency, and conversational flow all have disproportionate impact at scale. Building better voice agents with STT and TTS means grounding technical decisions in real user behavior and real business outcomes.

The technology is advancing quickly — but the teams that win will be the ones that design for how voice is actually experienced.

Contact us

Your request has been registered

A problem occurred while submitting the form.

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Newsletter

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

Accept

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

New model: Solaria-3

Test our real-time and async transcription

2026 Meeting Assistant Report

Read more

Call center voice analytics: use cases, benefits, and how it works

Customer sentiment analysis: methods, tools, and what voice data adds

Named Entity Recognition from call transcripts: improving precision

Gladia x Rime I Building better CX agents with STT and TTS

Why human-like TTS can hurt voice agent performance

STT and TTS accuracy vs quality: why precision matters more

Why STT benchmarks don’t reflect real-world voice agents

Voice agent design differences between inbound and outbound calls

Latency in voice agents is about perception, not just speed

Why A/B testing is essential for STT and TTS systems

Voice agents are user experiences, not just STT and TTS models

How successful teams build production-ready voice agents

Conclusion: building better voice agents with STT and TTS

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.