Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Get started

Speech-To-Text

Call center transcription software: what enterprises should look for in 2026

TL;DR: Most contact centers evaluate transcription software using clean-audio lab benchmarks, then watch QA automation break down when BPO (Business Process Outsourcing) agents switch languages mid-call or phone-line noise degrades the signal. In 2026, the criteria that matter are real-world multilingual WER, all-inclusive per-hour pricing, and data sovereignty that holds up under GDPR and HIPAA audit. For enterprise teams, the highest-ROI evaluation step is testing on real BPO call samples rather than vendor demo audio, and asking every shortlisted provider for an all-in per-hour price with diarization, sentiment, and entity extraction enabled.

Speech-To-Text

PII redaction for call recordings: how ingestion-level redaction keeps calls PCI compliant

TL;DR: Legacy pause-and-resume systems don't remove agents, local desktops, or telephony infrastructure from PCI DSS audit scope. Automated, ingestion-level PII redaction scrubs sensitive data before it reaches any database. By removing cardholder data at the ingestion layer, contact center platforms using automated redaction can potentially reduce audit complexity, cut agent handle time (AHT), and protect downstream CRM and LLM pipelines from corrupt data. The accuracy floor for reliable entity detection in PCI audits is significantly higher than for standard QA transcription, making STT model selection a compliance decision as much as a product one.

Speech-To-Text

GDPR, SOC 2, and ISO 27001 speech-to-text: the contact center compliance and certification guide

TL;DR: When your contact center routes voice data through a transcription vendor, every certification gap in that vendor's stack becomes your compliance liability. Voice recordings qualify as personal data under GDPR Article 4, and processing them through uncertified APIs creates direct financial exposure. This guide breaks down what GDPR, SOC 2 Type II, ISO 27001, HIPAA, and PCI DSS each require of your audio infrastructure vendor and maps those requirements to the QA coverage rates and cost-per-contact metrics you manage daily. We hold GDPR, SOC 2 Type II, ISO 27001, HIPAA, and PCI DSS certifications, and never use customer audio for model training on Growth or Enterprise plan.

How to build a meeting assistant with async transcription and LLM: Complete architecture guide

Published on April 7, 2026

Ani Ghazaryan

How to build a meeting assistant with async transcription and LLM: Complete architecture guide

Build a meeting assistant with async transcription and LLMs using clean architecture, diarization, and multilingual support.

TL;DR: Your LLM outputs are only as good as the transcript feeding them. Asynchronous transcription delivers higher accuracy, simpler infrastructure, and predictable costs compared to real-time pipelines, making it the right architectural foundation for summaries, action items, and analytics. Feature-metered STT pricing collapses your unit economics at scale because diarization, sentiment analysis, and entity extraction each add to your bill on most platforms. An STT layer that includes diarization, sentiment analysis, and entity extraction from the Starter tier and handles code-switching across 100+ languages removes per-feature fee variables from cost modeling at scale.

Product teams routinely allocate engineering cycles to LLM prompt iteration while treating their STT layer as a settled decision. STT infrastructure choices compound: the billing model, feature bundling, language coverage, and accuracy baseline you lock in at 1,000 hours determine your unit economics and your engineering overhead at 10,000 hours. Getting the STT decision wrong means absorbing cost surprises and accuracy regressions that are structurally harder to unwind than any prompt change.

This guide walks through the end-to-end architecture of a production-ready meeting assistant: from audio ingestion and asynchronous transcription through webhook handling, LLM intelligence layers, and scaling considerations. Every architectural decision here has a cost implication, and we will model those costs explicitly.

The architecture of a modern meeting intelligence platform

A meeting assistant is fundamentally a data pipeline with an intelligence layer on top. The audio is the input, the structured transcript is the intermediate artifact, and summaries, action items, and Q&A are the outputs. Getting that intermediate artifact right is where most products either win or leak engineering time indefinitely.

Core components and data flow

The eight stages of a production async meeting assistant pipeline follow a clear sequence.

Each stage maps to actual infrastructure decisions:

Audio capture: Recording bots or platform APIs (Zoom, Teams, Google Meet) produce audio files, typically in MP4 or M4A format, which you stage and pass to the transcription API.
Temporary storage: Files are staged in object storage (S3, GCS) and a URL is generated. Our async API accepts audio URLs rather than direct file uploads to the transcription endpoint, so generating a pre-signed URL at this stage is required.
POST to async API: A single POST request to the async transcription endpoint with your configuration payload kicks off transcription, with parameters for diarization, language detection, code-switching, custom vocabulary, and a callback URL.
Immediate job ID response: The API returns instantly with a transcription id and a result_url. You do not block your application thread waiting for the transcript.
Webhook notification: Configure a webhook endpoint in your Gladia account settings to receive a POST request when transcription completes. Alternatively, include a callback_url in the request body and we deliver the result directly to your endpoint.
Transcript retrieval: Poll result_url until status: done, or receive the result via webhook. The response includes the full transcript, word-level timestamps, speaker labels, and detected languages.
Text parsing and structuring: Parse the JSON response to extract diarized utterances. Each utterance carries a speaker ID, language tag, start and end timestamps, and per-word confidence scores, giving your LLM the context it needs to attribute action items accurately.
LLM processing and storage: Pass the structured transcript to your LLM pipeline for summarization, action item extraction, or semantic search indexing, then write the outputs to your database and surface them in your UI.

The full Gladia transcription flow documentation covers each API stage in detail.

Real-time vs. asynchronous processing trade-offs

For meeting assistants, post-call analytics tools, and compliance recording workflows, async transcription is the right architectural default. Async systems have access to the full audio context before generating output, which improves punctuation accuracy, word disambiguation, and diarization quality.

Real-time systems must generate output within a few hundred milliseconds with limited context, which creates a structural accuracy ceiling regardless of model quality.

The infrastructure cost difference is also significant. A real-time WebSocket connection requires infrastructure running continuously, with bidirectional state management and strict latency budgets. Async processing handles audio after the fact, meaning no persistent connection management and lower compute cost per audio hour.

Factor	Async (batch)	Real-time (WebSocket)
Accuracy	Higher, full-context processing	Lower, limited-context window
Latency	~60 seconds per hour of audio	Sub-300ms partial transcripts
Infrastructure complexity	Low, stateless API calls	High, persistent connection management
Best for	Meeting summaries, analytics, compliance	Live captions, voice agents
Cost	Lower per audio hour	Higher per audio hour

‍

The latency budget for a meeting summary is measured in seconds, not milliseconds. A user who records a 45-minute meeting is willing to wait a few seconds for a summary, so the architectural overhead of a real-time WebSocket connection buys nothing here.

Reserve real-time for use cases where it genuinely matters: live agent assist, voice bots, and real-time captioning. The Real-Time Webinar covers those patterns in detail.

Build vs. buy: evaluating STT infrastructure for meeting assistants

For most product teams, the build-versus-buy question for STT infrastructure comes down to total cost of ownership (TCO) once you account for GPU provisioning, model maintenance, and missing features like diarization and code-switching.

Self-hosting Whisper looks cheap until you model the full cost:

GPU infrastructure costs plus the engineering time to provision, scale, and maintain it
Engineering cycles for model updates, hallucination mitigation, and uptime
File size limits (the Whisper API caps at 25MB per request)
No native diarization (you build and maintain it separately)
No code-switching detection out of the box
Ongoing hallucination debugging on silence or low-signal audio

Self-hosting Whisper still leaves teams responsible for hallucination mitigation, multilingual robustness, diarization quality, and infrastructure reliability. At production scale, those do not stay edge cases. They become ongoing engineering work.

Build vs. buy comparison:

Factor	Self-hosted Whisper	Managed API
Setup time	Weeks to months	Sub-24 hours
Diarization	Build and maintain separately	Included, powered by pyannoteAI
Code-switching	Not supported natively	Supported across 100+ languages
Hallucination mitigation	Manual model tuning	Built-in mitigation and full managed pipeline
File limits	Depends on GPU memory	135 minutes, up to 1,000 MB per file
Pricing	GPU + engineering overhead	From $0.20/hr (Growth) to $0.61/hr (Starter), async; all audio intelligence features included at the Starter tier and above
Scaling	You manage GPU provisioning	API handles concurrency

‍

Total cost of ownership and pricing models at scale

Feature-metered STT pricing compounds like interest: when diarization, sentiment analysis, NER, summarization, translation, and code-switching are each billed as separate line items, projecting your monthly total at scale requires modeling six independent cost variables rather than one. At the Starter tier and above, our pricing bundles those features into the base hourly rate, with per-second billing rather than 15-second rounding blocks.

Cost model at 1,000 hours/month:

Provider type	Base rate	Diarization in async	Sentiment analysis	Monthly total
Gladia Starter	$0.61/hr	Included	Included	$610
Gladia Growth	As low as $0.20/hr	Included	Included	As low as $200
Feature-metered (illustrative)	$0.15/hr	Add-on	Add-on	$450+ before NER or translation

‍

Cost model at 10,000 hours/month:

Provider type	Base rate	Full audio intelligence stack	Monthly total
Gladia Growth	as low as $0.20/hr	Included	as low as $2,000
Typical feature-metered API	$0.15/hr	Add-ons stack per feature	~$4,500+ estimated before NER, translation add-ons (illustrative)

‍

Feature-metered estimates are illustrative, modeled on the add-on pricing structures published by providers like AssemblyAI and Deepgram. Actual costs vary by plan tier, currency, and which audio intelligence features are enabled. The figures above reflect a partial feature stack and will be higher once every feature a meeting assistant typically requires is priced in. Check each provider's current pricing page directly before modeling your own TCO. Feature-metered pricing can be the lower-cost option for teams that need basic transcription without diarization, NER, or sentiment analysis. The cost differential shifts once you enable the full audio intelligence stack a meeting assistant typically requires.

The feature-metered model looks cheaper at the base rate until you price in every feature a meeting assistant actually needs. Gladia's AssemblyAI pricing comparison and Deepgram pricing comparison walk through the add-on math in detail. When you enable the full audio intelligence stack, the all-inclusive model at the Starter tier and above means your cost at 10,000 hours is the hourly rate multiplied by hours, with no per-feature fee variables to account for separately. See Gladia's pricing page for the full tier breakdown.

Data privacy, SOC 2, and compliance requirements

Before you ship a meeting assistant to enterprise customers, answer these four security review questions with precision:

Model retraining: Does the STT provider use your audio to retrain their models by default? On Gladia’s paid plans, customer data is not used for model training by default, and no manual opt-out is required. On the Starter plan, customer data may be used for model training by default.
Certifications: We're SOC 2 Type II certified. Type I confirms controls are designed appropriately at a single point in time, while Type II confirms those controls are operating effectively over a sustained audit period, typically 6-12 months. That distinction matters in enterprise security reviews because Type II gives your legal and compliance teams evidence of ongoing operational effectiveness, not just policy documentation. We also hold ISO 27001 certification, covering the information security management system underpinning our infrastructure. We're also fully GDPR compliant (EU data privacy regulation). HIPAA alignment (US healthcare data protection standard) is also documented for regulated industries.
Data residency: We support both EU-west and US-west infrastructure, allowing teams to align deployment with their data residency and customer requirements. On-premises and air-gapped hosting are available for organizations with strict data sovereignty requirements.
Encryption: We encrypt all data transmitted to and from our infrastructure in transit via TLS. Audio stored temporarily during processing is encrypted at rest.

Our compliance guide covers GDPR, SOC 2, HIPAA, and ISO 27001 requirements in practical terms for STT vendor contract reviews.

Step-by-step tutorial: building the async transcription pipeline

Handling diverse audio, accents, and code-switching

Global meeting audio does not arrive clean. A 45-minute product review between a team based in Singapore, Paris, and Austin will contain background noise, overlapping speech, accented English, and mid-sentence switches between English and French or Tamil. Most STT APIs handle clean English audio well and degrade noticeably outside that condition.

Code-switching is the clearest differentiator. When a speaker shifts from English to French mid-sentence, most APIs either fail silently or return garbled output for the non-primary language segment.

Solaria-1 detects mid-utterance language transitions across 100+ languages, tags each segment with its identified language, and preserves speaker attribution across the switch without requiring manual configuration.

In async pipelines specifically, where the full audio file is available before output is generated, code-switching detection benefits from full-utterance context rather than the limited window available to real-time systems, which further improves language tag accuracy per segment.

Here's a simplified example of what this looks like in the API response (see the official API reference for the exact schema):

{
  "id": "45463597-20b7-4af7-b3b3-f5fb778203ab",
  "status": "done",
  "result": {
    "transcription": {
      "full_transcript": "Okay, that's a good point. Alors, on commence quand?",
      "languages": ["en", "fr"],
      "utterances": [
        {
          "text": "Okay, that's a good point.",
          "language": "en",
          "speaker": 0,
          "start": 0.21,
          "end": 2.35,
          "confidence": 0.98,
          "words": [
            {"word": "Okay", "start": 0.21, "end": 0.69, "confidence": 1.0},
            {"word": "that's", "start": 0.75, "end": 1.02, "confidence": 0.99},
            {"word": "a", "start": 1.05, "end": 1.12, "confidence": 0.98},
            {"word": "good", "start": 1.15, "end": 1.45, "confidence": 0.97},
            {"word": "point", "start": 1.50, "end": 2.35, "confidence": 0.99}
          ]
        },
        {
          "text": "Alors, on commence quand?",
          "language": "fr",
          "speaker": 1,
          "start": 3.10,
          "end": 5.45,
          "confidence": 0.96,
          "words": [
            {"word": "Alors", "start": 3.10, "end": 3.62, "confidence": 0.98},
            {"word": "on", "start": 3.70, "end": 3.95, "confidence": 0.95},
            {"word": "commence", "start": 4.02, "end": 4.68, "confidence": 0.97},
            {"word": "quand", "start": 4.75, "end": 5.45, "confidence": 0.94}
          ]
        }
      ]
    }
  }
}

Each utterance carries a speaker ID, a language tag, word-level timestamps, and confidence scores. This structured output gives downstream LLMs the per-utterance context required to attribute action items to the correct speaker and language without additional parsing or disambiguation. The pyannoteAI diarization webinar covers how pyannoteAI's Precision-2 model powers the diarization layer, and the diarization documentation details the configuration options available in the API.

"We have tested it across many many languages (we work with commentators in pro sports around the world) and have found great accuracy even with custom fields such as team names, player names, etc. We have never come across any sort of hallucination." - Xavier G., on G2

In a published customer case study, Claap, a meeting platform serving companies including Revolut, Kavak, and Qonto, reported achieving 1-3% WER in production using Gladia's async transcription API, with one hour of video transcribed in under 60 seconds. That result was measured across a multilingual international user base with varied accents and language code-switching, conditions that matter for global meeting assistants.

That production result reflects the async transcription architecture. Because the full audio file is available before output is generated, Gladia's pipeline can apply full-utterance context to language detection, accent handling, and speaker attribution in a single pass rather than processing incomplete windows. In async pipelines specifically, code-switching detection benefits from this full-utterance context rather than the limited window available to real-time systems, which further improves language tag accuracy per segment.

The Python integration below follows that same async pattern: submit audio, poll for completion, retrieve structured output. For meeting intelligence products where transcription happens after recording, which covers the majority of meeting assistant, QA, and analytics use cases, this is the right architectural default. Real-time streaming remains available for use cases where it genuinely matters, such as live agent assist or captioning, but the async path is simpler to operate, more resilient under variable network conditions, and produces more consistent diarization results across multilingual audio.

Integrating the Gladia async API

This Python implementation submits audio and retrieves the structured transcript. It follows the official API reference and is suitable as a starting point for a production integration.

import requests
import time

GLADIA_API_KEY = "YOUR_GLADIA_API_TOKEN"
AUDIO_URL = "YOUR_AUDIO_FILE_URL"  # Pre-signed S3 or GCS URL

GLADIA_URL = "<https://api.gladia.io/v2/pre-recorded>"
HEADERS = {
    "x-gladia-key": GLADIA_API_KEY,
    "Content-Type": "application/json"
}

REQUEST_PAYLOAD = {
    "audio_url": AUDIO_URL,
    "diarization": True,
    "diarization_config": {
        "number_of_speakers": 2,  # or None for auto-detect
        "min_speakers": 1,
        "max_speakers": 6
    },
    "detect_language": True,
    "enable_code_switching": True,
    "sentiment_analysis": True,
    "summarization": True,
    "named_entity_recognition": True,
    "callback_url": "<https://example.com/webhook/transcription>"
}

def submit_transcription():
    response = requests.post(GLADIA_URL, headers=HEADERS, json=REQUEST_PAYLOAD)
    if response.status_code != 200:
        raise Exception(f"Request failed: {response.status_code} - {response.text}")
    data = response.json()
    return data.get("id"), data.get("result_url")

def poll_for_result(result_url, poll_interval=5, max_attempts=120):
    for attempt in range(max_attempts):
        result = requests.get(result_url, headers=HEADERS).json()
        status = result.get("status")
        if status == "done":
            return result
        elif status in ["queued", "processing"]:
            time.sleep(poll_interval)
        else:
            raise Exception(f"Unexpected status: {status}. Resubmit if this is a server error.")
    raise TimeoutError("Transcription did not complete within the polling window.")

def extract_diarized_transcript(result):
    utterances = result.get("result", {}).get("transcription", {}).get("utterances", [])
    lines = []
    for u in utterances:
        speaker = f"Speaker {u.get('speaker', 'Unknown')}"
        lang = u.get("language", "")
        text = u.get("text", "")
        start = round(u.get("start", 0), 1)
        lines.append(f"[{start}s] {speaker} ({lang}): {text}")
    return "\\n".join(lines)

job_id, result_url = submit_transcription()
result = poll_for_result(result_url)
diarized_transcript = extract_diarized_transcript(result)
print(diarized_transcript)

The API accepts audio files up to 135 minutes and 500MB in size. Check the official documentation for current limits, as these may change with product updates. Gladia's async transcription product page lists the full set of supported formats, including WAV, M4A, FLAC, AAC, MP3, and MP4, along with guidance on handling files that approach these thresholds.

"Gladia deliver real time highly accurate transcription with minimal latency, even accross multiple languages and ascents, The API is straightforward and well documented, Making integration into our internal tools quick and easy." - Faes W., on G2

Designing the LLM intelligence layer

With a clean, diarized, language-tagged transcript from the pipeline above, your LLM layer becomes significantly more tractable. The accuracy of speaker attribution and language detection at the STT stage directly determines the quality of the LLM outputs. An LLM cannot extract "action item assigned to Speaker A" from a transcript that does not reliably identify who Speaker A is.

The audio-to-LLM documentation covers how the API's built-in summarization and NER features work within the API response, which you may choose to use as a baseline before adding a custom LLM layer for more specific extraction tasks.

Prompt engineering for summaries and action items

The diarized transcript format from the extraction function above gives your LLM speaker attribution at every utterance. Structure your prompt to make that attribution explicit.

SYSTEM_PROMPT = """
You are an expert meeting analyst. You receive meeting transcripts with speaker labels,
timestamps, and detected languages. Extract structured meeting intelligence from the transcript.
"""

ACTION_ITEM_PROMPT = """
Analyze the following meeting transcript and extract all action items, decisions, and commitments.

For each action item:
- Identify the speaker label it was assigned to (e.g., Speaker 0, Speaker 1)
- Capture the specific task description verbatim where possible
- Note any deadline mentioned, or mark as "Not specified"
- Note any dependency on another action item

TRANSCRIPT:
{diarized_transcript}

Return a JSON object with this structure:
{{
  "action_items": [
    {{
      "assigned_to": "Speaker 0",
      "task": "Description of the action item",
      "deadline": "Friday EOD or Not specified",
      "context": "Brief quote or context from the meeting"
    }}
  ],
  "key_decisions": [
    {{
      "decision": "What was decided",
      "made_by": "Speaker label or name if mentioned",
      "timestamp": "Approximate time in seconds"
    }}
  ],
  "summary": "2-3 sentence meeting summary"
}}
"""

def run_llm_pipeline(diarized_transcript, llm_client):
    prompt = ACTION_ITEM_PROMPT.format(diarized_transcript=diarized_transcript)
    response = llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": prompt}
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

Follow these prompt engineering principles:

Use the speaker label format exactly as it appears in the transcript. If you normalize "Speaker 0" to "Speaker A" in your extraction step, the LLM can match consistently.
Request JSON output with response_format: json_object to make downstream parsing deterministic.
Include a timestamp field in your action item schema. Cross-referencing action items to their moment in the meeting is a high-value feature for users reviewing long calls.

Managing context windows and token limits

A 60-minute meeting with multiple speakers produces roughly 8,000-12,000 words of transcript text, which fits within the context windows of models like GPT-4o (128k tokens) or Claude 3.5 Sonnet (200k tokens). A two-hour all-hands with a 20-person team can push past 25,000 words, and you will want a chunking strategy.

Two approaches work well in production:

Sliding window chunking: Split the transcript into 20-minute segments with a 2-minute overlap to preserve context at segment boundaries. Run the LLM over each chunk independently and merge the action item arrays at the end. To handle duplicates from the overlap zone, compute cosine similarity between action item embeddings (a lightweight sentence-transformer like all-MiniLM-L6-v2 works well here) and merge any pair scoring above 0.85, keeping the version from the chunk where the surrounding context is more complete. Exact string matches are rare in practice because the LLM paraphrases slightly across runs, so embedding similarity catches near-duplicates that naive string comparison misses.
Hierarchical summarization: Summarize each 20-minute chunk into a 200-word intermediate summary, then run action item extraction over the concatenated summaries. Hierarchical summarization reduces tokens but loses detail on specific commitments; the sliding window approach preserves action item recall and is worth the additional LLM cost for most products.

Production deployment and scaling considerations

Error handling, rate limits, and failover mechanisms

Moving from a working prototype to a production pipeline requires explicit handling for four failure modes.

Webhook timeouts: As a general best practice for webhook reliability, your webhook endpoint should respond with a 200 status as quickly as possible after receiving the transcription completion event. Acknowledge the webhook immediately and hand off processing asynchronously to a background worker or queue. This example uses an in-memory queue for simplicity. In production, use a durable message queue such as SQS or RabbitMQ to decouple webhook receipt from LLM execution.

from flask import Flask, request, jsonify
import queue

app = Flask(__name__)
processing_queue = queue.Queue()

@app.route("/webhook/transcription", methods=["POST"])
def handle_transcription_webhook():
    transcription_id = request.json.get("id")
    processing_queue.put(transcription_id)
    return jsonify({"status": "acknowledged"}), 200

API rate limits: The API enforces rate limits on concurrent requests per plan tier. The Starter plan allows 25 async concurrent requests and 30 real-time concurrent requests. The Growth plan offers flexible concurrent requests. Build a queue-based submission layer that respects these limits rather than handling 429 errors reactively.

Audio submission failures: Implement pre-flight validation before submitting to the API, and handle each failure case explicitly:

File size (under 500MB): If the file exceeds this limit, return a clear error to the caller with the actual file size and the maximum allowed. For audio that approaches or exceeds 500MB, split it into smaller segments before submission, tools like ffmpeg -f segment handle this reliably. Ensure each segment is independently decodable rather than a raw byte split.
Duration (under 135 minutes): Reject files that exceed the maximum duration and surface the detected duration in the error response. For longer recordings, split at natural silence boundaries using voice activity detection or a fixed interval with overlap to avoid cutting mid-utterance.
Format (WAV, M4A, FLAC, AAC, MP3, or MP4): If the input format is unsupported, transcode it before submission rather than passing it through and handling the API error. A simple ffmpeg -i input.ogg -c:a libmp3lame output.mp3 conversion covers most cases. Log the original format so you can track which sources consistently require transcoding.
Audio URL accessibility: Confirm the URL is publicly accessible or that the pre-signed URL has not expired before submitting. If the URL returns a non-200 status, surface that to the caller immediately, a common failure mode is pre-signed URLs expiring between generation and submission, so generate them as close to submission time as possible.

For files approaching any of these limits, consult the Gladia documentation for current thresholds, as these may vary by plan tier.

Status polling fallback: Store the result_url in your database immediately after job submission. If webhook delivery fails due to a network issue or server restart, your polling loop acts as a fallback. If you receive a failure status, resubmit the audio, as this typically reflects a transient server error.

Production checklist before going live:

Webhook endpoint responds with 200 promptly and enqueues processing async (a common best practice to avoid webhook timeouts)
result_url persisted in database on job creation
Retry logic with exponential backoff on 429 and 5xx responses
Audio pre-flight validation (size, duration, format) before submission
enable_code_switching set to True for any multilingual user base
Concurrency limits monitored against plan tier
LLM token usage tracked per meeting to model cost at scale
SOC 2 and GDPR documentation collected and included in your vendor security review

Our benchmarks page documents the latest evaluation across 8 STT providers, 7 datasets, and 74+ hours of audio. It includes conversational speech, accented speakers, and multilingual input, giving you a more practical reference point for validating accuracy claims against your own production audio.

Start with 10 free hours of async transcription and test the API on your own multilingual meeting audio to measure WER directly. Diarization, automatic language detection, and code-switching support are included from the Starter tier.

FAQs

How long does async transcription take to process one hour of pre-recorded audio?

In the Claap case study, one hour of audio reached status: done in under 60 seconds of wall-clock time from the initial POST submission to transcript availability.

What happens if my webhook endpoint does not receive the completion event?

Store the result_url returned by the initial POST response in your database immediately after job submission. If webhook delivery fails, your polling loop on result_url acts as an automatic fallback, polling every five seconds until status: done is returned.

Does enabling diarization, sentiment analysis, or NER add to the per-hour cost?

No. Diarization (powered by pyannoteAI's Precision-2 model), automatic language detection, and code-switching are included in the Starter tier base hourly rate. For the full feature availability breakdown by tier, including sentiment analysis, summarization, and NER, see the pricing page, as feature coverage varies across Starter, Growth, and Enterprise plans.

How many speakers can diarization handle accurately in a single meeting?

The diarization configuration accepts min_speakers and max_speakers parameters. Set max_speakers based on your typical meeting size distribution. For open-ended configurations, leave number_of_speakers unset and let automatic detection determine the count.

Key terms

Word error rate (WER): The standard accuracy metric for speech-to-text systems, calculated as (substitutions + deletions + insertions) / total reference words. WER should always be reported with a specific language, audio condition, and benchmark dataset, such as the public datasets used in Gladia’s latest benchmarks, to be comparable across providers.

Speaker diarization: The process of segmenting an audio recording and assigning each segment to the correct speaker. In a meeting context, diarization produces the "Speaker 0 said X, Speaker 1 said Y" structure that makes action item attribution possible. Our diarization is powered by pyannoteAI's Precision-2 model and runs as part of the async transcription pipeline.

Code-switching: The practice of alternating between two or more languages within a single conversation or even a single sentence. For global meeting assistants, handling code-switching accurately is a baseline requirement: bilingual teams and international meetings produce code-switching in production at significant rates, and many APIs still have limited or inconsistent support for code-switching in production workflows.

Async transcription: A processing model where audio is submitted to an API and the transcript is returned asynchronously via webhook or polling, rather than streamed in real time. Async transcription provides higher accuracy, simpler infrastructure, and lower cost per audio hour compared to real-time WebSocket transcription, making it the appropriate architecture for post-meeting intelligence workloads.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Speech-To-Text

Call center transcription software: what enterprises should look for in 2026

Speech-To-Text

PII redaction for call recordings: how ingestion-level redaction keeps calls PCI compliant

Speech-To-Text

GDPR, SOC 2, and ISO 27001 speech-to-text: the contact center compliance and certification guide

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

New model: Solaria-3

Test our real-time and async transcription

2026 Meeting Assistant Report

Read more

Call center transcription software: what enterprises should look for in 2026

PII redaction for call recordings: how ingestion-level redaction keeps calls PCI compliant

GDPR, SOC 2, and ISO 27001 speech-to-text: the contact center compliance and certification guide

How to build a meeting assistant with async transcription and LLM: Complete architecture guide

The architecture of a modern meeting intelligence platform

Core components and data flow

Real-time vs. asynchronous processing trade-offs

Build vs. buy: evaluating STT infrastructure for meeting assistants

Total cost of ownership and pricing models at scale

Data privacy, SOC 2, and compliance requirements

Step-by-step tutorial: building the async transcription pipeline

Handling diverse audio, accents, and code-switching

Integrating the Gladia async API

Designing the LLM intelligence layer

Prompt engineering for summaries and action items

Managing context windows and token limits

Production deployment and scaling considerations

Error handling, rate limits, and failover mechanisms

FAQs

Key terms

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.