Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Sign up

Mastering multilingual speech-to-text: handle code-switching with AI

The article explains why code-switching makes multilingual speech-to-text harder, especially when speakers switch languages mid-sentence or use accents in noisy environments.

Speech-To-Text

Best Whisper alternatives for 2026: Comparison of top speech-to-text APIs

The article compares the top Whisper alternatives for 2026 across accuracy, latency, pricing, features, and production readiness.

Speech-To-Text

Mastering CRM data enrichment: AI & speech-to-text for smarter leads

The article explains how AI and speech-to-text can enrich CRM records by turning sales calls into structured lead data like names, budgets, timelines, sentiment, and intent signals. It covers pipeline architecture, accuracy testing, compliance, cost planning, CRM integration, and production monitoring.

Gladia async transcription API: integration guide for meeting notes applications

Published on Apr 30, 2026

by Ani Ghazaryan

Gladia async transcription API: integration guide for meeting notes applications

The article explains how to integrate Gladia’s async transcription API into meeting note apps. It covers setup, audio submission, webhooks, polling, error handling, pricing, compliance, and scaling for production use.

TL;DR:

Gladia's /v2/pre-recorded async endpoint consolidates what most teams build across multiple vendors into a single POST request, handling transcription, diarization, NER, and summarization in one call.
Use webhooks for production delivery; polling with exponential backoff serves as the fault-tolerant fallback.
All audio intelligence features are included in the base hourly rate on Starter and Growth plans, no per-feature add-on charges.
Solaria-1 supports 100+ languages with mid-conversation code-switching; diarization is powered by pyannoteAI Precision-2 and is available in async workflows only.

Most engineering teams obsess over LLM prompt design for their meeting note-takers while the transcription layer quietly corrupts every downstream output. A hallucinated speaker attribution silently populates a wrong name in your CRM. A missed entity produces a misleading coaching score. By the time anyone catches it, the damage is already two systems deep.

This guide walks through exactly how to integrate our async transcription API into a meeting note-taker: authentication, request structure, webhook configuration, fault-tolerant polling, error handling, and the production deployment decisions that determine whether the system holds up at scale. For a broader architectural overview before diving into code, the complete AI note-taker guide is worth reading first.

Configuring your Gladia API access

Get authentication and environment configuration correct before writing integration code. Debugging transcription accuracy on top of a misconfigured client wastes hours.

Retrieve your Gladia API key

Sign up at app.gladia.io, navigate to Home, and select "Generate new API key." The dashboard lets you generate and rotate keys without a support ticket. Every request passes this key in the x-gladia-key header, not as a query parameter or in the request body.

Store the key in an environment variable from day one. Hardcoding it into application code is the fastest path to a credential rotation incident you didn't plan for.

Handling API throttling errors

The concurrency and rate limits documentation covers your tier's concurrent job ceiling. The HTTP 429 handler below gives you a retry wrapper you can drop into any client:

import requests
import time

def transcribe_with_retry(audio_url, api_key, max_retries=3):
    headers = {'x-gladia-key': api_key, 'Content-Type': 'application/json'}
    payload = {'audio_url': audio_url, 'diarization': True}
    
    for attempt in range(max_retries):
        try:
            response = requests.post(
                'https://api.gladia.io/v2/pre-recorded',
                json=payload,
                headers=headers
            )
            if response.status_code == 200:
                return response.json()
            elif response.status_code == 429:
                wait_time = 2 ** attempt
                print(f"Rate limited. Waiting {wait_time}s before retry...")
                time.sleep(wait_time)
                continue
            else:
                response.raise_for_status()
        except requests.exceptions.RequestException as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)
    raise Exception("Max retries exceeded")

Source: Gladia API reference - pre-recorded endpoint

Initialize your Gladia dev environment

Confirm these three prerequisites before submitting your first job:

API key stored in environment variables, not in application code.
Webhook endpoint registered and reachable from the public internet (or ngrok for local development).
Test audio file that reflects real production conditions: multiple speakers, background noise, accented speech, and mid-conversation language switching if your users require it.

For teams building on orchestration pipelines, we provide a native integration with Pipecat that removes boilerplate from submission and retrieval loops.

Submitting audio for async transcription

Our /v2/pre-recorded endpoint is the primary async workflow. You POST a request with your audio source and feature configuration, receive a job ID immediately, and retrieve the result later. We document every parameter in the full endpoint reference.

API audio submission: URL or file?

Our /v2/pre-recorded endpoint accepts audio URLs, not raw file bytes. If you have a local file, upload it first using POST /v2/uploadPOST /v2/upload with multipart/form-data, retrieve the hosted URL from that response, then pass that URL to /v2/pre-recorded.

For production meeting note-takers, the URL approach is the right default. Meeting recordings typically land in S3, GCS, or Azure Blob Storage. Generate a pre-signed URL from your object store and pass it directly. This pattern keeps payload sizes out of the submission request entirely and eliminates one class of timeout failures on large files. For a broader comparison of URL vs. file upload trade-offs in meeting assistant architectures, the async transcription and LLM guide covers the pipeline design in depth.

Structuring the async API request body

Here is a complete request body that enables the full audio intelligence suite for a meeting note-taker:

{
  "audio_url": "https://your-bucket.example.com/meeting_recording.wav",
  "diarization": true,
  "diarization_config": {
    "number_of_speakers": 2,
    "min_speakers": 1,
    "max_speakers": 2
  },
  "translation": true,
  "translation_config": {
    "target_languages": ["en", "es", "fr"]
  },
  "summarization": true,
  "summarization_config": {
    "type": "general"
  },
  "named_entity_recognition": true,
  "detect_language": true,
  "enable_code_switching": true,
  "webhook_url": "https://your-backend.example.com/webhooks/gladia"
}

Source: Gladia API reference - pre-recorded endpoint

Required parameters: audio_url only. All audio intelligence features (diarization, translation, summarization, named_entity_recognition) are optional and default to false if omitted.

We include everything in this payload, including diarization, NER, and summarization, in the base hourly rate on Starter and Growth plans. There are no per-feature add-on charges to discover after scoping. Check our pricing page for the full feature matrix across tiers.

API setup: diarization and custom lexicon

Our diarization uses pyannoteAI's Precision-2 model as the default when you set diarization: true. No explicit model parameter is required. The Precision-2 integration handles speaker boundary detection and overlap disambiguation even in mono recordings. The speaker diarization documentation and the Gladia x pyannoteAI webinar both cover how this works at the model level.

For domain-specific vocabulary (product names, internal acronyms, executive names), pass a custom_vocabulary array in the request body. This is the most impactful parameter for reducing entity-level errors in company-specific meeting content, and it maps directly to the 39% fewer entity errors we achieve on key entities compared to leading alternatives.

Note: we provide diarization in async workflows only. It is not available in the real-time WebSocket API.

Retrieving transcription results by job ID

A successful /v2/pre-recorded POST returns HTTP 200 immediately with a job ID and result URL:

{
  "id": "45463597-20b7-4af7-b3b3-f5fb778203ab",
  "result_url": "https://api.gladia.io/v2/transcription/45463597-20b7-4af7-b3b3-f5fb778203ab",
  "request_id": "G-45463597"
}

Source: Gladia API reference - pre-recorded endpoint

Persist both id and result_url to your database the moment you receive this response. If webhook delivery fails later, result_url is your fallback polling target.

Polling vs. webhooks for note-taker data

Both retrieval patterns are valid, but they serve different contexts. Use this decision table before wiring your architecture:

Method	Best for	Pros	Cons
Polling	CLI tools, short clips, batch dev jobs	Simple to implement, no public endpoint needed	Wastes resources, adds latency at scale
Webhooks	Production note-takers, SaaS products	Real-time notification, no idle resource consumption	Requires public endpoint and security validation

‍

For production note-takers, webhooks are the right default, with a polling fallback as the safety net.

Configuring exponential backoff for polling

Poll the result_url with exponential backoff. Do not poll on a fixed 1-second interval: at scale, aggressive polling pushes you into rate limit territory fast.

Python:

import requests
import time

def poll_transcription_result(result_url, api_key):
    headers = {'x-gladia-key': api_key}
    base_wait = 2
    max_wait = 60
    
    while True:
        response = requests.get(result_url, headers=headers)
        data = response.json()
        
        if data.get('status') == 'done':
            return data['result']
        elif data.get('status') in ['queued', 'processing']:
            print(f"Status: {data['status']}, waiting...")
            time.sleep(base_wait)
            base_wait = min(base_wait * 1.5, max_wait)
        else:
            raise Exception(f"Transcription failed: {data}")

Source: Gladia API reference - pre-recorded workflow

JavaScript:

async function pollTranscriptionResult(resultUrl, apiKey) {
  let baseWait = 2000;
  const maxWait = 60000;
  
  while (true) {
    const response = await fetch(resultUrl, {
      headers: { 'x-gladia-key': apiKey }
    });
    const data = await response.json();
    
    if (data.status === 'done') {
      return data.result;
    } else if (['queued', 'processing'].includes(data.status)) {
      console.log(`Status: ${data.status}, waiting...`);
      await new Promise(r => setTimeout(r, baseWait));
      baseWait = Math.min(baseWait * 1.5, maxWait);
    } else {
      throw new Error(`Transcription failed: ${JSON.stringify(data)}`);
    }
  }
}

Source: Gladia API reference - pre-recorded workflow

Webhook endpoint setup and payload structure

Pass webhook_url in the initial POST body. When transcription completes, we fire a POST to that URL containing the transcription_id. Use that ID to fetch the full structured result from result_url. Here is what the completed async payload looks like:

{
  "id": "45463597-20b7-4af7-b3b3-f5fb778203ab",
  "status": "done",
  "result": {
    "transcription": {
      "full_transcript": "Full transcript text...",
      "utterances": [
        {
          "speaker": 0,
          "language": "en",
          "transcript": "Hello, this is speaker one.",
          "start_time": 0.5,
          "end_time": 3.2,
          "words": [
            {"word": "Hello", "start": 0.5, "end": 1.0, "confidence": 0.99}
          ]
        },
        {
          "speaker": 1,
          "language": "en",
          "transcript": "And I'm speaker two.",
          "start_time": 3.5,
          "end_time": 5.2,
          "words": []
        }
      ]
    }
  }
}

Source: Gladia API reference - pre-recorded endpoint

The utterances array is your LLM input. Each utterance carries speaker, language, transcript, word-level timestamps, and per-word confidence scores. Feed this directly into your summarization or action-item extraction prompt without any intermediate parsing layer. The audio-to-LLM documentation covers how to structure this pipeline end to end.

For webhook security, validate incoming requests by extracting transcription_id from the payload and verifying it exists in your job store before triggering any downstream processing (LLM calls, CRM writes, notification events). This guards against replay attacks and stale webhook calls.

Designing for fault-tolerant API calls

Gladia API error response formats and retry strategy

Map these status codes in your error handler before going to production:

Status code	What it means	Retry strategy
400	Malformed request body or invalid parameter	Do not retry; log payload and fix code
401	API key invalid or expired	Do not retry; rotate key and alert on-call
413	Request payload exceeds server limit	Do not retry; split file before resubmitting
500	Gladia internal error	Retry with exponential backoff; alert if 3+ failures

‍

Ensuring API stability with idempotency

We maintain 99.9%+ uptime, which you can verify at status.gladia.io. Build idempotency into your client by storing the job ID before retrying any submission. If a network failure means you are unsure whether your POST succeeded, check your job store before submitting again, because duplicate submissions are billable.

"The API is straightforward and well documented, Making integration into our internal tools quick and easy. The speech to text quality for meetings, support calls, and voice notes has been consistently impressive." - Faes W. on G2

Uncaught payload issues in downstream LLMs

One failure mode that surfaces late is when your downstream LLM pipeline breaks silently because an expected field is absent from the transcription payload. NER entities, summarization blocks, and translation arrays only appear when they are successfully generated. Build defensive checks into your payload parser: if named_entity_recognition is missing from the result, log it and route to a fallback rather than letting a KeyError propagate. The common mistakes guide documents this pattern and others that engineering teams encounter in the first week of production.

Operationalizing Gladia for scale and cost

Async latency and comparison with real-time

Async processing runs at approximately 60 seconds per hour of audio. Claap measured this in production across their international multilingual user base. For post-meeting note delivery, this window is entirely acceptable: your users expect summaries after the meeting ends, not during it.

Feature	Async API	Real-time WebSocket API
Processing latency	~60 sec per hour of audio	~300ms final transcript
Diarization (pyannoteAI Precision-2)	Yes	Not available
Best for	Meeting notes, post-call analysis	Live voice agents, live captions
File input	URL or uploaded file	Streaming audio chunks
Summarization, NER	Yes	Yes (as real-time add-ons)
Pricing (Starter)	$0.61/hr	$0.75/hr

‍

Forecasting transcription costs

We bill per hour of audio duration. All audio intelligence features (diarization, NER, translation, summarization) are included in the base rate on Starter and Growth plans, with no feature add-ons to budget for separately.

Starter: $0.61/hr async, $0.75/hr real-time. Pay-as-you-go, 10 free hours per month. Customer data can be used for model training by default on this tier.
Growth: as low as $0.20/hr async, as low as $0.25/hr real-time. Upfront commitment reduces per-hour cost. Customer data is never used for model training, and no opt-out is required.
Enterprise: custom pricing with fine-tuning, debundled options, and on-premises deployment.

At 1,000 hours per month, the Starter plan runs $610. At Growth rates, that same volume drops to $200.

Compliance for AI note-taker data

We hold SOC 2 Type II, ISO 27001, HIPAA, and GDPR certifications. Full details are on the Gladia compliance hub. The data training policy deserves explicit attention before you sign anything:

Starter plan: Customer audio can be used for model training by default.
Growth and Enterprise plans: Customer audio is never used for model training. No opt-out action is required.

If your note-taker handles regulated conversations (financial services, healthcare, legal), Growth or above is the correct default. The no-retraining commitment is built into the plan structure, not buried in an enterprise contract addendum.

Resolving common integration hurdles

Supported audio formats and file limits

We accept WAV, M4A, FLAC, and AAC. Files must be under 1000 MB and 135 minutes in duration. For meetings exceeding 135 minutes, split the recording at a natural break and submit two jobs. Apply a time offset to the second payload's timestamps equal to the duration of the first file before merging the utterance arrays.

Production accuracy: what to expect and how to test

Our Solaria-1 model achieves on average 29% lower WER than alternatives on conversational speech, benchmarked across 8 providers, 7 datasets, and 74+ hours of audio. In production, Claap achieved 1-3% WER across a multilingual international user base, while teams running self-hosted open-source ASR models typically report over 10% WER on real-world meeting audio without the expected infrastructure savings.

Do not evaluate on clean benchmark audio. Pull samples from your actual production distribution: cross-talk, background noise, accented speakers, and the language pairs your users actually speak.

Managing Gladia API rate limits

Concurrent job limits are tier-dependent. Check the concurrency documentation for your specific ceiling. For teams processing bursts of post-meeting recordings at end-of-day or after all-hands events, architect a queue on the client side. A Redis-backed Celery queue works well: jobs land in the queue as meetings finish, and workers drain the queue within your concurrency budget without hitting a 429.

Multilingual audio handling with automatic detection

Solaria-1 detects language automatically when detect_language: true is set. When enable_code_switching: true is also set, the model tracks mid-conversation language transitions across all 100+ supported languages without resetting the session. No manual language specification is required. The model covers 42 languages not supported by any other API-level STT provider, including Tagalog, Bengali, Punjabi, Tamil, and Urdu, which matters for Contact Center as a Service (CCaaS) platforms and meeting tools serving Southeast Asia, South Asia, or multilingual EU markets. The code-switching documentation covers how to configure this correctly before your multilingual test run.

Start with 10 free hours and have your integration in production in less than a day. Get your API key and push your first proof-of-concept to staging this week.

FAQs

Can I use the Gladia async API for real-time meeting transcription?

No. The /v2/pre-recorded endpoint is asynchronous and designed for post-meeting audio. For live captions or voice agent pipelines, use the real-time WebSocket API, which targets approximately 300ms final transcript latency.

Is speaker diarization included in the base hourly rate?

Yes, on Starter and Growth plans. Diarization powered by pyannoteAI Precision-2 is included in the base hourly rate with no add-on charge, and it is available in async workflows only.

What happens to my audio data on the Starter plan?

On the Starter plan, customer audio can be used for model training by default. On Growth and Enterprise plans, your audio is never used for model training and no opt-out action is required.

What are Gladia's file size and duration limits for async transcription?

Files must be under 1000 MB and 135 minutes in duration. Enterprise users can request extended limits beyond these defaults.

Does Gladia require me to specify the language in the API request?

No. When detect_language: true is set, Solaria-1 detects language automatically. Adding enable_code_switching: true enables the model to track mid-conversation language changes across all 100+ supported languages without any manual specification.

What is the all-in cost for 10,000 hours per month with diarization enabled?

On Starter at $0.61/hr, 10,000 hours costs $6,100 with diarization included at no extra charge. On Growth at as low as $0.20/hr, the same volume costs $2,000.

How does Gladia's async WER compare to self-hosted open-source ASR models on meeting audio?

Teams running self-hosted open-source ASR models typically report over 10% WER on real-world meeting audio, while Claap achieved 1-3% WER in production using our async API. Our benchmarks show on average 29% lower WER than alternatives across 7 datasets and 74+ hours of conversational audio.

Key terms glossary

WER (word error rate): The percentage of words in the transcript that differ from the reference ground truth, calculated as (substitutions + deletions + insertions) / total reference words. Lower is better.

DER (diarization error rate): The percentage of audio duration incorrectly attributed to the wrong speaker or left unlabeled. Our async benchmark shows on average 3x lower DER than alternatives.

Diarization: The process of segmenting audio by speaker identity and labeling each utterance with a speaker ID. In our API, diarization runs on pyannoteAI Precision-2 and is available in async workflows only.

Code-switching: Mid-conversation language transitions where a speaker switches from one language to another. Solaria-1 detects and transcribes these transitions automatically across 100+ supported languages.

Exponential backoff: A retry strategy where each successive retry waits longer than the previous one, preventing a cascade of requests from overwhelming a rate-limited API endpoint.

Async transcription: A non-blocking transcription workflow where audio is submitted for processing and the result is retrieved later via polling or webhook, typically completing in approximately 60 seconds per hour of audio.

Idempotency: The property of an operation where submitting the same request multiple times produces the same result without unintended side effects, such as duplicate transcription charges.

SOC 2 Type II: A third-party audit certification verifying that a service provider's security controls have been in place and operating effectively over a defined period, typically six to twelve months.

NER (named entity recognition): Automated identification and classification of named entities (people, organizations, locations, product names) within a transcript, returned as structured metadata alongside the full transcript text.

Contact us

Your request has been registered

A problem occurred while submitting the form.

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Newsletter

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

Accept

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Read more

Mastering multilingual speech-to-text: handle code-switching with AI

Best Whisper alternatives for 2026: Comparison of top speech-to-text APIs

Mastering CRM data enrichment: AI & speech-to-text for smarter leads

Gladia async transcription API: integration guide for meeting notes applications

Configuring your Gladia API access

Retrieve your Gladia API key

Handling API throttling errors

Initialize your Gladia dev environment

Submitting audio for async transcription

API audio submission: URL or file?

Structuring the async API request body

API setup: diarization and custom lexicon

Retrieving transcription results by job ID

Polling vs. webhooks for note-taker data

Configuring exponential backoff for polling

Webhook endpoint setup and payload structure

Designing for fault-tolerant API calls

Gladia API error response formats and retry strategy

Ensuring API stability with idempotency

Uncaught payload issues in downstream LLMs

Operationalizing Gladia for scale and cost

Async latency and comparison with real-time

Forecasting transcription costs

Compliance for AI note-taker data

Resolving common integration hurdles

Supported audio formats and file limits

Production accuracy: what to expect and how to test

Managing Gladia API rate limits

Multilingual audio handling with automatic detection

FAQs

Can I use the Gladia async API for real-time meeting transcription?

Is speaker diarization included in the base hourly rate?

What happens to my audio data on the Starter plan?

What are Gladia's file size and duration limits for async transcription?

Does Gladia require me to specify the language in the API request?

What is the all-in cost for 10,000 hours per month with diarization enabled?

How does Gladia's async WER compare to self-hosted open-source ASR models on meeting audio?

Key terms glossary

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.