How real-time transcription creates a competitive advantage in fintech

Published on 9 July, 2025
How real-time transcription creates a competitive advantage in fintech

Fintech is evolving fast. Gone are the days of clunky logins and endless passwords. Today, users expect seamless account access, minimal friction and one-click payments.

Clients also require high-quality, responsive customer experience. Money is a deeply personal, sensitive issue for consumers. Even the slightest issue must be resolved quickly, efficiently, and respectfully.

Consumers today can connect, sync their data and make payments to anyone in a matter of seconds. And they expect to receive support and clear up issues just as quickly. Which means long call waiting times and slow responses are out of the question for modern financial service providers. 

Read on to see how this fintech company uses Gladia’s AI transcription API to deliver an exceptional customer experience, fast.

Challenge

Because finances are so important to consumers, fintech companies often field a high volume of support and service calls. In this case, we’re talking millions of hours per year!

Speech-to-text tools are therefore critical. This fintech solution needs to transcribe and translate all these calls automatically, both in real time and after the fact. They can react instantly to issues as they arise, triage the highest-priority issues, and use call data for training and product improvements. 

Regardless of volume and language, speech transcription needs to deliver on all fronts to do its job—latency has to be low and accuracy high. Users shouldn’t have to trade one for the other. 

Another, very specific issue for this company is speech cadence. Bots tend to speak at varying speeds, with numbers being pronounced in differing rhythms. For example, “one hundred thousand” may be transcribed as “one hundred” and “thousand” separately. Which of course leads to transcription errors. 

Objectives

The goal was a bespoke transcription service that was both highly accurate—especially in dealing with numbers—and fast. Because of its specific product and customers, wrong numbers could result in payments being made out to the wrong account, and amounts being off by a lot. Accurate numbers were paramount.

But they also couldn’t afford to keep customers waiting. Users were calling into financial institutions for quick resolution, and in many cases debt collections relied on this fast service. Real-time latency quickly became the second main objective as they evaluated STT providers.  

Solution

The company compared Gladia to Google’s Speech-to-Text API and OpenAI’s Whisper. Straight out of the box, they were impressed by Gladia’s combination of speed and accuracy, and decided to move ahead. 

Gladia’s CTO, Jonathan Soto, set out to optimize the model using a ‘dual-track’ approach.

 “It was a very interesting game to play, and the findings we had during the testing phase have since been implemented in our pre-processing for all Gladia users to enjoy”, says Jonathan. 

There are three ways to improve the accuracy of a speech-to-text model: 

  1. Pre-processing: Pre-processing ensures the model receives the highest quality audio possible. Developers can work with things like noise reduction and speech enhancement to optimize the quality of audio inputs.
  1. Model optimization: Optimizing the actual model involves adjustments in model training and deployment. This includes architecture tuning with transformers, and making sure the model’s training data is diverse and of high quality.
  1. Post-processing: Post-processing corrects and refines the model’s output after transcription. This reduces errors and improves readability with techniques like regular expression and custom vocabulary

Jonathan and his team worked two streams to achieve the highest possible accuracy on numericals, while also solving for the stark contrast in speech cadence of different bots. 

In post-processing, Jonathan’s team used regular expressions (regex) to match specific text patterns and increase the overall level of accuracy.

Pre-processing is where things get more tricky. A speech-to-text model is split into three parts: 

  1. Voice activity detection (VAD) identifies the parts of an audio signal that contains human speech. 
  2. Transcription converts the detected speech into text. 
  3. Alignment maps the text to the corresponding timestamps to synchronize the transcription with the audio.

Jonathan’s pre-processing work focused on VAD. More specifically, he focused on silences. When speech is interrupted with silences, transcription models can miss chunks and entire utterances that come towards the end of a passage. This results in severe error rates on numbers. 

To solve this, Jonathan used autoregressive prediction to predict whether someone is speaking or not in an audio stream. 

Autoregressive prediction bases the speech or non-speech status of the audio on previously processed frames (a frame typically lasts between 10-50 milliseconds). This creates a chain-like structure where a certain token only gets predicted and transcribed if another specific token has appeared previously. 

In short, it creates highly accurate transcriptions.

Impact

For this particular fintech company, the results were remarkable. By partnering with Gladia, they were able to:

  • Maintain a 98,5% numerical accuracy on all transcribed calls. In other words, almost every account number and financial figure is captured correctly.
    Achieve a latency of under 300 milliseconds for all their real-time transcription. 
  • Have up to 800 concurrent sessions running at the same time.

Put together, these figures amount to very responsive and highly effective customer service and support. And in consumer finance, that high-quality service is a true competitive advantage. 

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more

Product News

Getting started with Gladia: How to build with our STT API features

Whether you’re using Gladia’s speech-to-text (STT) API during a free trial or a long-term integration, you care about one thing: getting accurate, reliable transcriptions that work for your product and users.

Case Studies

How real-time transcription creates a competitive advantage in fintech

Fintech is evolving fast. Gone are the days of clunky logins and endless passwords. Today, users expect seamless account access, minimal friction and one-click payments.

Speech-To-Text

Real-time agent assist: Unlocking better call center services with speech-to-text

Customer service is evolving fast to meet new challenges. Today's clients expect immediate, accurate answers to increasingly specific queries and complaints. Meanwhile, contact centers need to reduce costs, improve efficiency, and maintain compliance…all while delivering exceptional experiences.

Read more