TL;DR:
- Gladia's
/v2/pre-recorded async endpoint consolidates what most teams build across multiple vendors into a single POST request, handling transcription, diarization, NER, and summarization in one call. - Use webhooks for production delivery; polling with exponential backoff serves as the fault-tolerant fallback.
- All audio intelligence features are included in the base hourly rate on Starter and Growth plans, no per-feature add-on charges.
- Solaria-1 supports 100+ languages with mid-conversation code-switching; diarization is powered by pyannoteAI Precision-2 and is available in async workflows only.
Most engineering teams obsess over LLM prompt design for their meeting note-takers while the transcription layer quietly corrupts every downstream output. A hallucinated speaker attribution silently populates a wrong name in your CRM. A missed entity produces a misleading coaching score. By the time anyone catches it, the damage is already two systems deep.
This guide walks through exactly how to integrate our async transcription API into a meeting note-taker: authentication, request structure, webhook configuration, fault-tolerant polling, error handling, and the production deployment decisions that determine whether the system holds up at scale. For a broader architectural overview before diving into code, the complete AI note-taker guide is worth reading first.
Configuring your Gladia API access
Get authentication and environment configuration correct before writing integration code. Debugging transcription accuracy on top of a misconfigured client wastes hours.
Retrieve your Gladia API key
Sign up at app.gladia.io, navigate to Home, and select "Generate new API key." The dashboard lets you generate and rotate keys without a support ticket. Every request passes this key in the x-gladia-key header, not as a query parameter or in the request body.
Store the key in an environment variable from day one. Hardcoding it into application code is the fastest path to a credential rotation incident you didn't plan for.
Handling API throttling errors
The concurrency and rate limits documentation covers your tier's concurrent job ceiling. The HTTP 429 handler below gives you a retry wrapper you can drop into any client:
import requests
import time
def transcribe_with_retry(audio_url, api_key, max_retries=3):
headers = {'x-gladia-key': api_key, 'Content-Type': 'application/json'}
payload = {'audio_url': audio_url, 'diarization': True}
for attempt in range(max_retries):
try:
response = requests.post(
'https://api.gladia.io/v2/pre-recorded',
json=payload,
headers=headers
)
if response.status_code == 200:
return response.json()
elif response.status_code == 429:
wait_time = 2 ** attempt
print(f"Rate limited. Waiting {wait_time}s before retry...")
time.sleep(wait_time)
continue
else:
response.raise_for_status()
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt)
raise Exception("Max retries exceeded")
Source: Gladia API reference - pre-recorded endpoint
Initialize your Gladia dev environment
Confirm these three prerequisites before submitting your first job:
- API key stored in environment variables, not in application code.
- Webhook endpoint registered and reachable from the public internet (or ngrok for local development).
- Test audio file that reflects real production conditions: multiple speakers, background noise, accented speech, and mid-conversation language switching if your users require it.
For teams building on orchestration pipelines, we provide a native integration with Pipecat that removes boilerplate from submission and retrieval loops.
Submitting audio for async transcription
Our /v2/pre-recorded endpoint is the primary async workflow. You POST a request with your audio source and feature configuration, receive a job ID immediately, and retrieve the result later. We document every parameter in the full endpoint reference.
API audio submission: URL or file?
Our /v2/pre-recorded endpoint accepts audio URLs, not raw file bytes. If you have a local file, upload it first using POST /v2/uploadPOST /v2/upload with multipart/form-data, retrieve the hosted URL from that response, then pass that URL to /v2/pre-recorded.
For production meeting note-takers, the URL approach is the right default. Meeting recordings typically land in S3, GCS, or Azure Blob Storage. Generate a pre-signed URL from your object store and pass it directly. This pattern keeps payload sizes out of the submission request entirely and eliminates one class of timeout failures on large files. For a broader comparison of URL vs. file upload trade-offs in meeting assistant architectures, the async transcription and LLM guide covers the pipeline design in depth.
Structuring the async API request body
Here is a complete request body that enables the full audio intelligence suite for a meeting note-taker:
{
"audio_url": "https://your-bucket.example.com/meeting_recording.wav",
"diarization": true,
"diarization_config": {
"number_of_speakers": 2,
"min_speakers": 1,
"max_speakers": 2
},
"translation": true,
"translation_config": {
"target_languages": ["en", "es", "fr"]
},
"summarization": true,
"summarization_config": {
"type": "general"
},
"named_entity_recognition": true,
"detect_language": true,
"enable_code_switching": true,
"webhook_url": "https://your-backend.example.com/webhooks/gladia"
}
Source: Gladia API reference - pre-recorded endpoint
Required parameters: audio_url only. All audio intelligence features (diarization, translation, summarization, named_entity_recognition) are optional and default to false if omitted.
We include everything in this payload, including diarization, NER, and summarization, in the base hourly rate on Starter and Growth plans. There are no per-feature add-on charges to discover after scoping. Check our pricing page for the full feature matrix across tiers.
API setup: diarization and custom lexicon
Our diarization uses pyannoteAI's Precision-2 model as the default when you set diarization: true. No explicit model parameter is required. The Precision-2 integration handles speaker boundary detection and overlap disambiguation even in mono recordings. The speaker diarization documentation and the Gladia x pyannoteAI webinar both cover how this works at the model level.
For domain-specific vocabulary (product names, internal acronyms, executive names), pass a custom_vocabulary array in the request body. This is the most impactful parameter for reducing entity-level errors in company-specific meeting content, and it maps directly to the 39% fewer entity errors we achieve on key entities compared to leading alternatives.
Note: we provide diarization in async workflows only. It is not available in the real-time WebSocket API.
Retrieving transcription results by job ID
A successful /v2/pre-recorded POST returns HTTP 200 immediately with a job ID and result URL:
{
"id": "45463597-20b7-4af7-b3b3-f5fb778203ab",
"result_url": "https://api.gladia.io/v2/transcription/45463597-20b7-4af7-b3b3-f5fb778203ab",
"request_id": "G-45463597"
}
Source: Gladia API reference - pre-recorded endpoint
Persist both id and result_url to your database the moment you receive this response. If webhook delivery fails later, result_url is your fallback polling target.
Polling vs. webhooks for note-taker data
Both retrieval patterns are valid, but they serve different contexts. Use this decision table before wiring your architecture:
| Method |
Best for |
Pros |
Cons |
| Polling |
CLI tools, short clips, batch dev jobs |
Simple to implement, no public endpoint needed |
Wastes resources, adds latency at scale |
| Webhooks |
Production note-takers, SaaS products |
Real-time notification, no idle resource consumption |
Requires public endpoint and security validation |
For production note-takers, webhooks are the right default, with a polling fallback as the safety net.
Configuring exponential backoff for polling
Poll the result_url with exponential backoff. Do not poll on a fixed 1-second interval: at scale, aggressive polling pushes you into rate limit territory fast.
Python:
import requests
import time
def poll_transcription_result(result_url, api_key):
headers = {'x-gladia-key': api_key}
base_wait = 2
max_wait = 60
while True:
response = requests.get(result_url, headers=headers)
data = response.json()
if data.get('status') == 'done':
return data['result']
elif data.get('status') in ['queued', 'processing']:
print(f"Status: {data['status']}, waiting...")
time.sleep(base_wait)
base_wait = min(base_wait * 1.5, max_wait)
else:
raise Exception(f"Transcription failed: {data}")
Source: Gladia API reference - pre-recorded workflow
JavaScript:
async function pollTranscriptionResult(resultUrl, apiKey) {
let baseWait = 2000;
const maxWait = 60000;
while (true) {
const response = await fetch(resultUrl, {
headers: { 'x-gladia-key': apiKey }
});
const data = await response.json();
if (data.status === 'done') {
return data.result;
} else if (['queued', 'processing'].includes(data.status)) {
console.log(`Status: ${data.status}, waiting...`);
await new Promise(r => setTimeout(r, baseWait));
baseWait = Math.min(baseWait * 1.5, maxWait);
} else {
throw new Error(`Transcription failed: ${JSON.stringify(data)}`);
}
}
}
Source: Gladia API reference - pre-recorded workflow
Webhook endpoint setup and payload structure
Pass webhook_url in the initial POST body. When transcription completes, we fire a POST to that URL containing the transcription_id. Use that ID to fetch the full structured result from result_url. Here is what the completed async payload looks like:
{
"id": "45463597-20b7-4af7-b3b3-f5fb778203ab",
"status": "done",
"result": {
"transcription": {
"full_transcript": "Full transcript text...",
"utterances": [
{
"speaker": 0,
"language": "en",
"transcript": "Hello, this is speaker one.",
"start_time": 0.5,
"end_time": 3.2,
"words": [
{"word": "Hello", "start": 0.5, "end": 1.0, "confidence": 0.99}
]
},
{
"speaker": 1,
"language": "en",
"transcript": "And I'm speaker two.",
"start_time": 3.5,
"end_time": 5.2,
"words": []
}
]
}
}
}
Source: Gladia API reference - pre-recorded endpoint
The utterances array is your LLM input. Each utterance carries speaker, language, transcript, word-level timestamps, and per-word confidence scores. Feed this directly into your summarization or action-item extraction prompt without any intermediate parsing layer. The audio-to-LLM documentation covers how to structure this pipeline end to end.
For webhook security, validate incoming requests by extracting transcription_id from the payload and verifying it exists in your job store before triggering any downstream processing (LLM calls, CRM writes, notification events). This guards against replay attacks and stale webhook calls.
Designing for fault-tolerant API calls
Gladia API error response formats and retry strategy
Map these status codes in your error handler before going to production:
| Status code |
What it means |
Retry strategy |
| 400 |
Malformed request body or invalid parameter |
Do not retry; log payload and fix code |
| 401 |
API key invalid or expired |
Do not retry; rotate key and alert on-call |
| 413 |
Request payload exceeds server limit |
Do not retry; split file before resubmitting |
| 500 |
Gladia internal error |
Retry with exponential backoff; alert if 3+ failures |
Ensuring API stability with idempotency
We maintain 99.9%+ uptime, which you can verify at status.gladia.io. Build idempotency into your client by storing the job ID before retrying any submission. If a network failure means you are unsure whether your POST succeeded, check your job store before submitting again, because duplicate submissions are billable.
"The API is straightforward and well documented, Making integration into our internal tools quick and easy. The speech to text quality for meetings, support calls, and voice notes has been consistently impressive." - Faes W. on G2
Uncaught payload issues in downstream LLMs
One failure mode that surfaces late is when your downstream LLM pipeline breaks silently because an expected field is absent from the transcription payload. NER entities, summarization blocks, and translation arrays only appear when they are successfully generated. Build defensive checks into your payload parser: if named_entity_recognition is missing from the result, log it and route to a fallback rather than letting a KeyError propagate. The common mistakes guide documents this pattern and others that engineering teams encounter in the first week of production.
Operationalizing Gladia for scale and cost
Async latency and comparison with real-time
Async processing runs at approximately 60 seconds per hour of audio. Claap measured this in production across their international multilingual user base. For post-meeting note delivery, this window is entirely acceptable: your users expect summaries after the meeting ends, not during it.
| Feature |
Async API |
Real-time WebSocket API |
| Processing latency |
~60 sec per hour of audio |
~300ms final transcript |
| Diarization (pyannoteAI Precision-2) |
Yes |
Not available |
| Best for |
Meeting notes, post-call analysis |
Live voice agents, live captions |
| File input |
URL or uploaded file |
Streaming audio chunks |
| Summarization, NER |
Yes |
Yes (as real-time add-ons) |
| Pricing (Starter) |
$0.61/hr |
$0.75/hr |
Forecasting transcription costs
We bill per hour of audio duration. All audio intelligence features (diarization, NER, translation, summarization) are included in the base rate on Starter and Growth plans, with no feature add-ons to budget for separately.
- Starter: $0.61/hr async, $0.75/hr real-time. Pay-as-you-go, 10 free hours per month. Customer data can be used for model training by default on this tier.
- Growth: as low as $0.20/hr async, as low as $0.25/hr real-time. Upfront commitment reduces per-hour cost. Customer data is never used for model training, and no opt-out is required.
- Enterprise: custom pricing with fine-tuning, debundled options, and on-premises deployment.
At 1,000 hours per month, the Starter plan runs $610. At Growth rates, that same volume drops to $200.
Compliance for AI note-taker data
We hold SOC 2 Type II, ISO 27001, HIPAA, and GDPR certifications. Full details are on the Gladia compliance hub. The data training policy deserves explicit attention before you sign anything:
- Starter plan: Customer audio can be used for model training by default.
- Growth and Enterprise plans: Customer audio is never used for model training. No opt-out action is required.
If your note-taker handles regulated conversations (financial services, healthcare, legal), Growth or above is the correct default. The no-retraining commitment is built into the plan structure, not buried in an enterprise contract addendum.
Resolving common integration hurdles
Supported audio formats and file limits
We accept WAV, M4A, FLAC, and AAC. Files must be under 1000 MB and 135 minutes in duration. For meetings exceeding 135 minutes, split the recording at a natural break and submit two jobs. Apply a time offset to the second payload's timestamps equal to the duration of the first file before merging the utterance arrays.
Production accuracy: what to expect and how to test
Our Solaria-1 model achieves on average 29% lower WER than alternatives on conversational speech, benchmarked across 8 providers, 7 datasets, and 74+ hours of audio. In production, Claap achieved 1-3% WER across a multilingual international user base, while teams running self-hosted open-source ASR models typically report over 10% WER on real-world meeting audio without the expected infrastructure savings.
Do not evaluate on clean benchmark audio. Pull samples from your actual production distribution: cross-talk, background noise, accented speakers, and the language pairs your users actually speak.
Managing Gladia API rate limits
Concurrent job limits are tier-dependent. Check the concurrency documentation for your specific ceiling. For teams processing bursts of post-meeting recordings at end-of-day or after all-hands events, architect a queue on the client side. A Redis-backed Celery queue works well: jobs land in the queue as meetings finish, and workers drain the queue within your concurrency budget without hitting a 429.
Multilingual audio handling with automatic detection
Solaria-1 detects language automatically when detect_language: true is set. When enable_code_switching: true is also set, the model tracks mid-conversation language transitions across all 100+ supported languages without resetting the session. No manual language specification is required. The model covers 42 languages not supported by any other API-level STT provider, including Tagalog, Bengali, Punjabi, Tamil, and Urdu, which matters for Contact Center as a Service (CCaaS) platforms and meeting tools serving Southeast Asia, South Asia, or multilingual EU markets. The code-switching documentation covers how to configure this correctly before your multilingual test run.
Start with 10 free hours and have your integration in production in less than a day. Get your API key and push your first proof-of-concept to staging this week.
FAQs
Can I use the Gladia async API for real-time meeting transcription?
No. The /v2/pre-recorded endpoint is asynchronous and designed for post-meeting audio. For live captions or voice agent pipelines, use the real-time WebSocket API, which targets approximately 300ms final transcript latency.
Is speaker diarization included in the base hourly rate?
Yes, on Starter and Growth plans. Diarization powered by pyannoteAI Precision-2 is included in the base hourly rate with no add-on charge, and it is available in async workflows only.
What happens to my audio data on the Starter plan?
On the Starter plan, customer audio can be used for model training by default. On Growth and Enterprise plans, your audio is never used for model training and no opt-out action is required.
What are Gladia's file size and duration limits for async transcription?
Files must be under 1000 MB and 135 minutes in duration. Enterprise users can request extended limits beyond these defaults.
Does Gladia require me to specify the language in the API request?
No. When detect_language: true is set, Solaria-1 detects language automatically. Adding enable_code_switching: true enables the model to track mid-conversation language changes across all 100+ supported languages without any manual specification.
What is the all-in cost for 10,000 hours per month with diarization enabled?
On Starter at $0.61/hr, 10,000 hours costs $6,100 with diarization included at no extra charge. On Growth at as low as $0.20/hr, the same volume costs $2,000.
How does Gladia's async WER compare to self-hosted open-source ASR models on meeting audio?
Teams running self-hosted open-source ASR models typically report over 10% WER on real-world meeting audio, while Claap achieved 1-3% WER in production using our async API. Our benchmarks show on average 29% lower WER than alternatives across 7 datasets and 74+ hours of conversational audio.
Key terms glossary
WER (word error rate): The percentage of words in the transcript that differ from the reference ground truth, calculated as (substitutions + deletions + insertions) / total reference words. Lower is better.
DER (diarization error rate): The percentage of audio duration incorrectly attributed to the wrong speaker or left unlabeled. Our async benchmark shows on average 3x lower DER than alternatives.
Diarization: The process of segmenting audio by speaker identity and labeling each utterance with a speaker ID. In our API, diarization runs on pyannoteAI Precision-2 and is available in async workflows only.
Code-switching: Mid-conversation language transitions where a speaker switches from one language to another. Solaria-1 detects and transcribes these transitions automatically across 100+ supported languages.
Exponential backoff: A retry strategy where each successive retry waits longer than the previous one, preventing a cascade of requests from overwhelming a rate-limited API endpoint.
Async transcription: A non-blocking transcription workflow where audio is submitted for processing and the result is retrieved later via polling or webhook, typically completing in approximately 60 seconds per hour of audio.
Idempotency: The property of an operation where submitting the same request multiple times produces the same result without unintended side effects, such as duplicate transcription charges.
SOC 2 Type II: A third-party audit certification verifying that a service provider's security controls have been in place and operating effectively over a defined period, typically six to twelve months.
NER (named entity recognition): Automated identification and classification of named entities (people, organizations, locations, product names) within a transcript, returned as structured metadata alongside the full transcript text.