Building better voice agents: Lessons from Thoughtly × Gladia's webinar
Published on Oct 22, 2025
Voice AI has evolved fast — from early experiments that barely handled a “hello” to today’s real-time conversational agents running across industries. Alex Casella (CTO at Thoughtly) sat down with Gladia’s CEO Jean-Louis Quéguiner to unpack the technical and operational realities of building production-grade voice agents.
The discussion went deep into latency, orchestration, evals, infrastructure design, and the business maturity of the agentic AI market. Here’s what every builder and buyer of voice AI can learn from it.
About Thoughtly
Thoughtly is a voice AI platform that helps businesses deploy human-like AI sales agents. Using a no-code, drag and drop UI, Thoughtly unlocks production-grade, pre-skilled, conversational voice agents for enterprise use cases — from concept to launch, all in under 20 minutes.
Alex leads product development and engineering at Thoughtly across core application, automations, integrations, platform infrastructure, and ML Ops.
The rise of voice AI: from experimentation to adoption
When Thoughtly was founded in 2023, the voice AI ecosystem was still young. Back then, Alex recalls, “people were mostly experimenting — they came to us to see what AI could do, not to deploy it in production.”
That has changed. As LLMs, speech-to-text (STT), and text-to-speech (TTS) quality improved — thanks to advances from OpenAI, Google, and pure players — experimentation gave way to trust. Enterprises now expect agents that can handle entire customer interactions end-to-end.
Thoughtly’s own growth reflects this maturity. The company has moved from serving SMBs to powering enterprise-grade, omni-channel voice agents across voice, phone, SMS, and email — handling millions of calls with 3× higher conversions and up to 15× ROI compared to manual outreach.
Latency: the invisible UX metric that defines realism
“Latency is how fast an agent responds after the caller finishes speaking,” Alex explains. Humans notice even a one-second delay. That’s why latency is one of the hardest technical challenges in real-time voice AI.
Every conversation turn crosses multiple hops (STT → LLM → TTS → VoIP) and each layer adds milliseconds. Cutting delay means optimizing all of them, often simultaneously.
Gladia’s Jean-Louis adds: “For the end user, latency isn’t about raw numbers; it’s the difference between a natural conversation and a robotic one.”
To stay fast, teams experiment with speculative generation, vendor multiplexing (sending requests to several models and using the fastest acceptable result), and caching.
Thoughtly has even used vector-based similarity lookups to skip LLM calls when the conversation is deterministic — a clever, mathematical way to gain a few milliseconds without sacrificing realism.
Evals: the unsung foundation of product velocity
As the stack evolves, staying current isn’t enough — you need evaluation frameworks that measure whether changes actually improve outcomes. “There’s no standard eval in voice AI,” Alex notes. “Every company has to build its own.”
Thoughtly began by replaying real conversations through a golden dataset to benchmark STT, LLM, and TTS combinations using cosine similarity for regression detection. Over time, internal evals became central to how the team validates new vendors and models.
Jean-Louis shared how startups like Coval are now building eval tooling for the wider ecosystem — a sign that the voice AI industry is maturing from artisanal to industrial.
Scaling infrastructure without over-engineering
Voice AI workloads are spiky. Outbound call surges can be predicted; inbound support peaks often cannot. That’s why, says Alex, scaling is about anticipation and proportionality:
“Don’t build too far ahead of where you are today — but don’t be caught unprepared either.”
Thoughtly uses regionalized infrastructure, smart caching (especially around TTS), and WebSocket communication to manage scale. Jean-Louis added that over-provisioning may look safe, but it can quickly become expensive: “GPUs take 10–20 minutes to boot. Over-provisioning hides that delay — but at a 20–40 percent infra cost increase.”
The real answer, both agree, lies in forecasting, proactive monitoring, and smart orchestration across vendors.
Vendor selection and orchestration strategy
Building a voice agent means orchestrating multiple moving parts: STT, LLM, TTS, VoIP, plus CRM and API integrations. Every millisecond and every API call counts.
“There are a lot of vendors in the market, and they’re not all equal,” says Alex. “Having great evals and benchmarking saves you from wasting months chasing the wrong tool.”
Thoughtly was impressed by Gladia’s speech-to-text capabilities, notably because of its industry-leading multilingual coverage (100+ languages), essential for global deployments.
For TTS, they highlighted Cartesia and several open-source models listed in TTS Arena as fast-moving alternatives to Eleven Labs.
Compliance, security, and trust at scale
As enterprise adoption grows, so do expectations around HIPAA, SOC 2, and data segregation. Compliance itself is evolving into a modular service layer — with tools like Delve, Vanta, or Drata helping startups certify faster.
Jean-Louis cautioned that infrastructure sprawl can make compliance painful later: “Keep your system lean. The more vendors you stack, the harder your audit becomes.”
The human factor: culture, hiring, and iteration
Technical excellence isn’t just code. Both speakers stressed culture. Jean-Louis compared product iteration to going to the gym:
“If nothing’s breaking, you’re not going fast enough. If everything’s breaking, you’re going too fast.”
Finding that balance requires engineers who can build quickly and enforce discipline. Alex added: “We invest in strong monitoring — like wearing an Apple Watch for your infrastructure — so we can push hard without burning out the system.”
Build vs. buy: why self-hosting no longer gives you an edge
Alex Casella and Jean-Louis Quéguiner unpacked one of the biggest strategic questions for any voice AI builder: Should you self-host your models, or rely on specialized vendors?
A year ago, Alex explained, running models on-prem was often necessary:
“This time last year, a speech-to-text model typically took 400–700 ms to respond. Hosting it yourself made sense if you needed to stay under one second.”
But by late 2024, that equation flipped:
“Now benchmarks are 50–150 ms. At that point, the benefit of self-hosting disappears — especially when you factor in infra cost, server hops, and cold-start delays.”
Jean-Louis reinforced the hidden costs: GPUs can take 10–20 minutes to spin up.
Over-provisioning to avoid cold starts inflates infra bills by 20 to 40%— and distracts teams from what actually matters: building great customer experiences.
The takeaway?
Focus engineering time on orchestration and product logic, not model plumbing. Modern STT and TTS APIs now deliver production-grade performance without the operational drag of managing GPUs, scaling clusters, or re-benchmarking every update.
As Alex put it, “We’ve reached the point where the market moves faster than any one startup can keep up by hosting its own models.”
Building for the long game: product-market fit and verticalization
Asked what advice he’d give to new startups in voice AI, Alex was clear:
“Focus on delivering real value, not just demos. Pick a vertical and go deep.”
Jean-Louis agreed: the voice AI opportunity is enormous — from customer support to healthcare and logistics — but it’s too broad for one vendor to own. The winners will be those who find a product-market fit in specialization, integrations, and execution speed.
Key takeaways:
Latency defines realism. Optimize every hop (STT → LLM → TTS → VoIP) and benchmark constantly.
Evals are your compass. Build custom evaluation frameworks early to track regressions and pick the right models.
Scale smart, not big. Use caching, regional infra, and forecasting; avoid over-engineering or over-provisioning.
Choose vendors deliberately. Benchmark everything — quality, speed, and multilingual coverage matter more than brand names.
Compliance isn’t optional. SOC 2/HIPAA readiness is a must for enterprise trust.
Culture drives velocity. Hire for both scrappiness and rigor; monitor performance like you would your health metrics.
Invest in orchestration and product logic, not managing GPUs. Modern STT and TTS APIs now deliver production-grade performance that voice agent providers can benefit from, and focus on building the best customer UX.
Find your niche. Voice AI is vast; defensibility comes from solving one problem extremely well.
Power your voice agent with Gladia
With low-latency, high-accuracy speech-to-text, Gladia lets you ground every customer interaction in clean, contextual, and production-grade input.
Ready to build voice agents your customers can trust? Book a demo with us.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
Building better voice agents: Lessons from Thoughtly × Gladia's webinar
Voice AI has evolved fast — from early experiments that barely handled a “hello” to today’s real-time conversational agents running across industries. Alex Casella (CTO at Thoughtly) sat down with Gladia’s CEO Jean-Louis Quéguiner to unpack the technical and operational realities of building production-grade voice agents.
Safety, hallucinations, and guardrails: How to build voice AI agents you can trust
As voice agents become a core part of customer and employee experience, users need to know these AI systems are accurate, safe, and acting within boundaries. That’s especially true for enterprise-grade tools, where a rogue voice agent can severely damage relationships and create major legal risks.
How Aircall cut transcription time by 95% with Gladia
The contact center is transforming. Traditionally defined by manual workflows, siloed data, and reactive customer service, today's Contact Center as a Service (CCaaS) platforms are embracing a new era—one driven by real-time AI and automation.