Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Mastering real-time transcription: speed, accuracy, and Gladia's AI advantage
TL;DR: Most use cases like meeting assistants, post-call analytics, and note-taking tools don't need real-time transcription. Async delivers higher accuracy and better speaker attribution because the model processes the complete recording. Sub-300ms latency is a functional requirement only for voice agents, live captions, and live agent assist tools where immediate output is non-negotiable. Gladia's Solaria-1 delivers around 270ms average latency with 100+ language support and native code-switching for the use cases that do require it.
Automated call scoring: Best practices for AI-powered QA and performance
TL;DR: Most contact centers manually review only a fraction of calls, leaving coaching decisions based on incomplete data. Automated call scoring closes that gap by combining async transcription with LLM-based evaluation, but every downstream score is bounded by the accuracy of your STT layer. When it fails on accented speakers or multilingual audio, compliance scores, sentiment flags, and coaching alerts all break, making STT engine selection the highest-leverage infrastructure decision in your QA stack.
Generate automated follow-up emails from meeting recordings with Gladia and Claude
TL;DR: The bottleneck in automated meeting follow-ups is not the LLM writing the email. It's the transcription layer feeding it: wrong speaker labels and missed entities produce emails that sound generic or silently corrupt your CRM. Building your own pipeline with Gladia and Claude gives you predictable per-hour billing and strict data controls on paid tiers, backed by Solaria-1's on average 29% lower WER than competing APIs on conversational speech.
Gladia x Rime I Building better CX agents with STT and TTS
Published on Dec 23, 2025
Jean-Louis Quéguiner
Despite rapid advances in voice agent stacks, including speech-to-text (STT), text-to-speech (TTS), and large language models (LLMs), the real-world promise of fully autonomous voice assistants remains largely unmet.
In a recent webinar, our CEO, Jean-Louis, chatted with Lily Clifford, a speech technology researcher and founder of a TTS-specialized Rime, to unpack what’s working, what isn’t, and what truly matters when building voice agents that deliver an on-call experience that matches user expectations.
You can watch our conversation below or read on practical summary of key learnings from the field.
Why human-like TTS can hurt voice agent performance
Since the earliest days of TTS, the end goal for many teams has been the same: make the voice sound human, warm, and natural. But one of the most surprising insights from the discussion was that this goal can sometimes backfire.
As Lily put it during the webinar:
“Overwhelmingly, every point of data that we have shows that if you hear professional voice actor TTS, you’re more likely to hang up.”
In demos, highly expressive voices often sound impressive. But in real customer interactions, especially over the phone, they can feel unnatural or even suspicious.
Another important point Lily made was about first impressions:
“If you hang up after the first thing the agent says, you never get to experience how good the system actually is.”
In other words, the voice sets expectations instantly. If that expectation feels wrong, the conversation ends before the intelligence of the system ever has a chance to show.
STT and TTS accuracy vs quality: why precision matters more
Another recurring theme from the webinar was the difference between accuracy and quality — two concepts that are often conflated.
When discussing STT and TTS performance, teams usually focus on:
Quality, meaning how natural or fluent the output sounds
Accuracy, meaning whether the output is correct
But in production voice agents, there’s a more important distinction: precision.
As Jean-Louis explained during the webinar:
“You can be 99% accurate, but if the only thing you get wrong is the first name, that’s enough to kill the experience.”
Precision on critical entities matters far more than global averages — and this is something many evaluation methods fail to capture.
Public ASR benchmarks are useful for research and high-level comparison, but they often fall short when applied to real-world voice agents.
Enterprise voice systems don’t operate on clean, labeled datasets. They operate in environments with:
Emerging or proprietary brand names
Customer names and addresses
Email addresses and alphanumeric identifiers
Code-switching between languages
Noisy audio and telephone artifacts
One example discussed in the webinar highlighted how benchmarks can even penalize correct behavior. In bilingual calls, a model may accurately transcribe speech in multiple languages, while the benchmark’s ground truth labels everything as “foreign language.” In this case, the model is right — but the evaluation framework is wrong.
This leads to a dangerous outcome: models that are optimized for benchmarks rather than for customer impact. When benchmarks don’t measure what matters in production, they incentivize the wrong trade-offs.
Evaluation should reflect business outcomes, not just synthetic test scores.
Voice agent design differences between inbound and outbound calls
Not all voice interactions are the same, and one important distinction is whether a call is inbound or outbound.
In inbound calls, the system has little to no prior context. The caller could be reaching out for many different reasons, which means broader vocabularies, more ambiguity, and greater tolerance for exploratory interaction.
Outbound calls are very different. The system usually knows who it is calling and why. In these scenarios, users are far less tolerant of latency or misunderstanding. Precision and speed become critical, especially when confirming names, appointments, or transactions.
Designing effective voice agents requires acknowledging these differences and configuring STT, TTS, and orchestration logic accordingly.
Latency in voice agents is about perception, not just speed
Latency is often treated as a purely technical metric, measured in milliseconds. But in conversational systems, latency is as much about perception as it is about speed.
If a system responds too slowly, users feel like they weren’t heard. If it responds too quickly, it can feel unnatural or interruptive. What matters is conversational rhythm — the timing that aligns with human expectations of turn-taking.
Optimizing latency isn’t about minimizing it at all costs. It’s about making responses feel timely and intentional within the flow of conversation.
Why A/B testing is essential for STT and TTS systems
The most successful voice teams don’t rely on intuition or static configurations. They test.
As Lily shared:
“Some of our customers are running forty different voices in production and seeing which one reduces abandonment.”
Based on her experience, most voice teams today run experiments across:
Different voices and speaking styles
Turn-taking strategies
Latency thresholds
Precision rules for critical entities
Small changes in these parameters can have an outsized impact on conversion rates, call completion, or customer satisfaction. At scale, even a one percent improvement can translate into significant business value.
For voice agents, experimentation isn’t optional — it’s foundational.
Voice agents are user experiences, not just STT and TTS models
It’s tempting to think of voice agents as pipelines: STT feeds into an LLM, which feeds into TTS. But the reality is far more complex.
Voice agents are user experiences. Many of the hardest problems don’t live inside model weights — they live in orchestration, integration, evaluation, and design.
Building better voice agents requires systems thinking, continuous measurement, and a deep understanding of how humans perceive and interact with voice interfaces.
How successful teams build production-ready voice agents
From the patterns discussed in the webinar, high-performing teams tend to:
Prioritize precision over perceived realism
Design differently for inbound and outbound interactions
Test and iterate continuously in production
Measure success through business metrics like conversion and churn
Treat evaluation as a product problem, not just a modeling task
These teams focus less on theoretical perfection and more on real-world performance.
Conclusion: building better voice agents with STT and TTS
If there’s one takeaway from this discussion, it’s this: in voice interfaces, “good enough” is not good enough.
Accuracy, precision, latency, and conversational flow all have disproportionate impact at scale. Building better voice agents with STT and TTS means grounding technical decisions in real user behavior and real business outcomes.
The technology is advancing quickly — but the teams that win will be the ones that design for how voice is actually experienced.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
Mastering real-time transcription: speed, accuracy, and Gladia's AI advantage
Speech-To-Text
Automated call scoring: Best practices for AI-powered QA and performance
Speech-To-Text
Generate automated follow-up emails from meeting recordings with Gladia and Claude
From audio to knowledge
Subscribe to receive latest news, product updates and curated AI content.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.