Getting started with Gladia: How to build with our STT API features

Published on July 10, 2025

Whether you’re using Gladia’s speech-to-text (STT) API during a free trial or a long-term integration, you care about one thing: getting accurate, reliable transcriptions that work for your product and users.

And while our API is certainly user friendly, there are a few steps you can take during setup to get the best results in production.

This guide covers the most important things to understand as you start using Gladia. From choosing between real-time and asynchronous APIs, to tuning for accuracy, handling multilingual conversations, and enabling advanced features like speaker diarization or live stream control, it’s all here in this article.

We’ve also included tips for measuring transcription accuracy and validating performance, so you can build with confidence and ensure the API is delivering on your goals.

Real-time vs asynchronous APIs

We’ll talk throughout this guide about Gladia’s API. But technically, there are two forms of API to keep in mind:

Real-time: Transcribes audio as it's happening, with very low latency. This is ideal for live customer support or contact center calls where real-time transcription can assist agents on the fly.
Asynchronous: This tool transcribes pre-recorded audio or video files. This is best for very large files, for meta-analysis, or for quality assurance, compliance checks, and coaching.

You’ll be able to seamlessly shift between both forms of API as you use Gladia. In this guide, we’ll provide links to specific documentation and will point out when each API version is concerned.

Now, let’s look at how our STT APIs help achieve several key aims for voice agent tools and CCaaS providers.

1. How to enhance STT transcript accuracy

Gladia’s API includes several smart ways to improve speech-to-text transcript accuracy. Automated transcripts have become markedly more accurate as the years have passed, but there are still some common issues that our API helps to overcome.

Particularly in noisy call center environments, or in dealing with non-native speakers, these easy-to-manage additions can make all the difference.

1.1 - Reduce background noise with speech threshold

Customer interactions often include unwanted background noise on the caller’s end. This could be traffic sounds, office noise, and other soundscapes or disturbances.

To reduce these, you can enable the speech_threshold parameter. It filters out nonessential audio frequencies so the transcription can focus on the speakers.

It’s also helpful for poor quality audio (below 8khz), and will typically deliver far more accurate transcriptions. We recommend setting it to 0.85 for very low-quality audio.

Specifically for the real-time API. Read the documentation.

1.2 - Create custom vocabulary for key terms and phrases

Automated transcription often misunderstands important industry terms, acronyms, and other jargon. Brand names with unique spelling are a particular issue. And when they occur over and over, your transcripts end up littered with errors.

For words or phrases that recur often in your audio file, use the custom_vocabulary feature in the transcription configuration settings. This lets you build a glossary of terms you want the transcription tool to recognize every time.

The more technical or unusual the terms your customers use regularly, the more valuable this feature becomes.

Learn more (including the key issues it solves) in this guide to custom vocabulary.

Available for both real-time and asynchronous APIs

1.3 - Remove initial audio disturbances

This isn’t a specific feature, but rather a recommendation. In some cases, background noises such as telephone tones or ambient sounds at the beginning of an audio can interfere with Gladia’s audio normalization process and reduce transcription accuracy.

Even if the rest of the audio is clear and high quality, the initial disturbance can disrupt the entire transcription.

If these noises are particularly loud (high in decibels)—if possible—either remove them before processing or start the recording only after the noise has subsided. This applies specifically to asynchronous users, where the audio or video is pre-recorded and editable.

2. How to optimize for diverse languages

Accurately transcribing a wide range of languages, dialects, and mixed speech remains a major challenge for STT tools. Many models still perform best with dominant global languages like English or Spanish, and struggle with regional accents, code-switching, or underrepresented languages. This can lead to incomplete or incorrect transcripts, especially in multilingual conversations or markets.

For global products and platforms, that’s more than a technical glitch—it’s a barrier to access, usability, and trust. To serve diverse users effectively, STT systems need to handle the way people actually speak: with linguistic variety, cultural nuance, and fluid transitions between languages.

That’s why we have several features specifically designed to improve transcription performance across a broader spectrum of real-world speech.

2.1 - Handle language changes and code-switching with precision

In real-world conversations, people don’t always stick to one formal language. A customer may start in English, drop in a local phrase, or respond to an agent in their preferred language mid-call.

This “code-switching” is natural, especially in global markets, but it can confuse traditional STT systems.

Gladia’s API is designed to handle these shifts gracefully. You can either:

Let the model automatically detect the spoken language(s), or
Help guide it by specifying one or more expected languages using the languages parameter.

If you're expecting multiple languages in a single conversation, setting the code_switching parameter to true allows the model to dynamically adjust to each language in real time.

Here’s how different configurations work:

Case 1: code_switching = true with languages specified‍

The system will switch between only the specified languages (e.g., English, Spanish, French), improving accuracy and speed by narrowing the possibilities.

Case 2: code_switching = false with languages specified‍

Use this for monolingual audio when the language isn't known ahead of time. The system will detect the language at the start (based on your list) and stick with it throughout.

Case 3: code_switching = true with no languages specified

This is the most flexible configuration. The system will detect and switch between any of Gladia’s supported languages. That’s ideal when language variety is high, but keep in mind accuracy may be slightly lower without a defined list.

Case 4: code_switching = false with no languages specified

‍The system picks the first detected language and assumes it's consistent. Good for quick transcription when you expect clean, single-language audio.

This flexibility lets product teams serve multilingual users more effectively without needing to segment or pre-process the audio. It’s highly valuable in transcribing international support calls, user-generated content, or mixed-language meetings.

Available for both real-time and asynchronous APIs

2.2 - Translate your transcription automatically

Once your audio is transcribed, Gladia’s Translation feature can generate real-time or asynchronous translations into any of your preferred languages.

This enables multilingual content, interfaces, and analytics without separate tools or pipelines. Whether you're building global customer support platforms or localized transcription services, it helps teams reach users across language barriers effortlessly.

Available for both real-time and asynchronous APIs

2.3 - Make translations smarter with tone and context

Translation isn’t just about switching words. It’s about delivering the right meaning in the right tone. To help, Gladia has two advanced parameters:

Context: Guide the model with cues about audience, tone, or domain. For example: “Use a polite tone in Brazilian Portuguese” or “Use technical language for IT professionals.”
Informal: Choose between formal or informal tone in languages that require it (like French, German, or Spanish). This improves user trust and clarity—especially in customer-facing content.

Here’s some sample code to test in your product:

"translation_config": {
    "target_languages": ["pt"],
    "context_adaptation": true,
    "context": "Translate with Portuguese of Portugal only and not from Brazil",
    "informal": true
    }

These controls help product teams deliver translations that feel natural and on-brand, improving user experience and engagement across languages.

3. Other advanced capabilities

Accurate transcription is the most pressing need for most users, including where multiple languages are involved. But production-ready STT systems often require additional features to support real-world complexity.

Here are a few useful capabilities in Gladia’s APIs that go beyond the mere basics.

3.1 - Speaker diarization for multi-speaker clarity

Speaker diarization separates and labels different voices in a recording—vital for meetings, interviews, or customer service calls with more than one person. This lets you attribute quotes, analyze speaker behavior, or structure the transcript more clearly.

Gladia supports two approaches:

Channel-based diarization: If your audio has separate channels for each speaker (common in stereo call recordings or video conferencing tools), Gladia can transcribe each independently. This enables basic speaker separation without additional processing.
Automatic speaker diarization: For mono or mixed-channel recordings, Gladia also supports fully automated diarization using a model that identifies and labels speakers dynamically. Just activate this via a feature flag in your API request.

Specifically available for asynchronous API. Read the documentation.

3.2 - Live message stream control

Using Gladia’s Real-Time API, you’re not just sending audio. You’re receiving a stream of live metadata that can power your application logic in real time.

You can configure which messages to receive via WebSocket, webhook, or callback, including:

speech_start / speech_end – Detect the start or end of a speaker’s utterance for triggering voice agent actions or UI changes.
transcript_result – Capture and display transcripts incrementally as the conversation unfolds.
Post-session summaries or analytics via callback after the WebSocket closes.

This is especially useful for voice agents, live captioning, or any system where fast, incremental feedback matters.

Specifically available for real-time API users.

Read the Live Messages documentation
Read the Post-Processing Messages documentation (sent after WebSocket is closed)

3.3 - Complete feature reference

Looking for something more specific like timestamps, language detection, or profanity filtering? Gladia’s APIs offer a wide range of features tailored for different needs and industries.

Explore the full set here:

4. Best practices to evaluate transcription accuracy

Accurately measuring the performance of a speech-to-text (STT) system is more complex than it looks. While metrics like Word Error Rate (WER) are standard, the quality of your evaluation depends heavily on the process you follow and the data you compare against.

If you're planning to benchmark accuracy yourself, here are some best practices to ensure meaningful and fair results:

Start with reliable Ground Truth: Your reference transcription must be fully aligned with the original audio, capturing not just the words, but also punctuation, speaker turns, and any meaningful nonverbal sounds. Even small errors in the reference will skew your results.
Normalize filler words: Unless filler words (“um,” “uh,” “like,” “you know”) are critical to your use case (e.g. analyzing hesitations in speech coaching), exclude them from accuracy calculations. They’re common in speech, but not always relevant for measuring comprehension.
Account for equivalent variations: Word variants that don’t change meaning—like “okay” vs. “ok” or “because” vs. “’cause”—can appear as errors in a raw WER calculation. Use normalization rules or custom mappings to avoid penalizing the system unfairly.
Use consistent evaluation criteria: Whether you're comparing multiple STT systems or running long-term benchmarks, ensure consistency in how you tokenize, align, and score transcripts. Inconsistent criteria make comparisons unreliable.
Consider use-case relevance, not just raw WER: A slightly lower WER might still produce a better customer experience if the system captures key business terms more reliably (e.g. brand names, product features, legal phrases). Use domain-specific test cases whenever possible.

Gladia can also support you in designing or validating your accuracy testing process. WER is just one lens—but to get meaningful results, it requires a methodical, granular approach and trusted reference data.

Ensure accurate, fast, and reliable speech-to-text

Whether you’re running a proof of concept or deploying to production at scale, the Gladia API is fast, accurate & flexible. You just have to push a few of the right buttons to get the most out of it.

Whether you're optimizing for real-time feedback, multilingual coverage, or detailed conversation insights, the features are here to support you.

As you continue testing or integrating Gladia into your product, don’t hesitate to go beyond the basics. Fine-tuning parameters, leveraging advanced features, and evaluating accuracy with care can make a real difference in quality, performance, and customer experience.

Need help along the way? Our team is here to support your success.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Getting started with Gladia: How to build with our STT API features

Whether you’re using Gladia’s speech-to-text (STT) API during a free trial or a long-term integration, you care about one thing: getting accurate, reliable transcriptions that work for your product and users.

Case Studies

How real-time transcription creates a competitive advantage in fintech

Fintech is evolving fast. Gone are the days of clunky logins and endless passwords. Today, users expect seamless account access, minimal friction and one-click payments.

Speech-To-Text

Real-time agent assist: Unlocking better call center services with speech-to-text

Customer service is evolving fast to meet new challenges. Today's clients expect immediate, accurate answers to increasingly specific queries and complaints. Meanwhile, contact centers need to reduce costs, improve efficiency, and maintain compliance…all while delivering exceptional experiences.