Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Get started

Product News

Gladia CLI: transcribe audio from your terminal in one command

You have a recording on your desk and you need the text. Forty minutes later, you're reading API docs about polling intervals, writing an upload handler, and you still don't have the transcript. That gap between "I have audio" and "I have text” is filled with code nobody wants to write. Today we're shipping the shortcut.

Speech-To-Text

Call center transcription software: what enterprises should look for in 2026

TL;DR: Most contact centers evaluate transcription software using clean-audio lab benchmarks, then watch QA automation break down when BPO (Business Process Outsourcing) agents switch languages mid-call or phone-line noise degrades the signal. In 2026, the criteria that matter are real-world multilingual WER, all-inclusive per-hour pricing, and data sovereignty that holds up under GDPR and HIPAA audit. For enterprise teams, the highest-ROI evaluation step is testing on real BPO call samples rather than vendor demo audio, and asking every shortlisted provider for an all-in per-hour price with diarization, sentiment, and entity extraction enabled.

Speech-To-Text

PII redaction for call recordings: how ingestion-level redaction keeps calls PCI compliant

TL;DR: Legacy pause-and-resume systems don't remove agents, local desktops, or telephony infrastructure from PCI DSS audit scope. Automated, ingestion-level PII redaction scrubs sensitive data before it reaches any database. By removing cardholder data at the ingestion layer, contact center platforms using automated redaction can potentially reduce audit complexity, cut agent handle time (AHT), and protect downstream CRM and LLM pipelines from corrupt data. The accuracy floor for reliable entity detection in PCI audits is significantly higher than for standard QA transcription, making STT model selection a compliance decision as much as a product one.

Building a meeting summarization pipeline: async STT + LLM in 5 steps

Published on April 20, 2026

by Ani Ghazaryan

Building a meeting summarization pipeline: async STT + LLM in 5 steps

Building a meeting summarization pipeline with async STT and LLM in 5 steps: audio ingestion, API integration, and prompt engineering.

TL;DR: Building a meeting summarization pipeline means pairing an async STT API with an LLM in a sequence where the transcription layer sets the accuracy ceiling for everything downstream. Async STT gives the model full audio context before producing output. This guide covers 5 concrete steps: configuring audio ingestion, integrating an async STT API via webhooks, validating diarized speech data, engineering LLM prompts, and formatting output. We built Solaria-1 to handle the STT layer with 100+ languages, true code-switching, and predictable per-hour pricing, with no model retraining on Growth and Enterprise plans, so you can ship to production in under a day.

Most engineering teams spend weeks tuning large language model (LLM) prompts for meeting summaries, only to discover that the real bottleneck is the transcription layer. If your STT fails on an accented speaker or drops a mid-sentence language switch, no amount of prompt engineering will fix the hallucinations that propagate into summaries, action items, and CRM entries. You'll find out when users start complaining about inaccurate meeting notes.

This guide breaks down how to build a production-ready meeting summarization pipeline using asynchronous STT and LLMs. We cover audio ingestion, transcript processing, prompt engineering, and output formatting, showing how to replace a fragmented infrastructure stack with a single API call.

Update: new model released

Since publishing this article, Gladia has released Solaria-3 — our newest speech model, built specifically for real-world business audio: noisy, fast-paced, and conversational. On production recordings, Solaria-3 ranks #1 across English and core European languages (EN, FR, DE, ES, IT), beating AssemblyAI, ElevenLabs, Deepgram, Mistral, and Speechmatics. It’s also 26% more accurate than Solaria-1 on real English customer calls. That said, the two models are built to complement each other, not compete. Solaria-1 remains the better choice if you need broad language coverage (100+ languages), code-switching support, real-time streaming, or if your audio is clean, formal, or institutional, such as parliamentary recordings. Solaria-3 is the upgrade if your priority is accuracy on European business audio, call center recordings, or anything noisy and conversational. Not sure which to use?

Compare Solaria-1 and Solaria-3 →

See the open-source STT benchmark →

Why async transcription matters for production summarization

When you're choosing between real-time and batch transcription for meeting summaries, the decision isn't primarily about user-facing latency. For post-meeting summaries, users accept a delay of seconds or minutes in exchange for output that's reliable. The real choice is about accuracy, diarization quality, and infrastructure TCO.

STT mode selection for your pipeline

Real-time streaming is the right tool for live captions and voice agent turn-detection. For meeting summaries, it forces a trade-off you don't need to make: partial transcripts lack the full audio context needed for accurate punctuation, word disambiguation, and speaker boundary detection.

Batch processing analyzes the complete recording before generating output, which directly improves all three. Here's how the two modes compare for this use case:

Mode	Latency	Accuracy ceiling	Diarization	Best for
Async (batch)	~60s per hour of audio	Higher (full context)	pyannoteAI Precision-2 (async only)	Meeting summaries, post-call analytics, note-takers
Real-time (streaming)	~300ms final transcript	Lower (partial context)	Not available (async only)	Live captions, voice agent assist

‍

For meeting summarization, three factors matter more than latency:

Accuracy ceiling: A summary generated 60 seconds after the meeting ends is indistinguishable from one generated 10 minutes after from a user experience standpoint.
WER and DER impact: Summary quality depends on transcription accuracy, not on how fast the transcription arrives.
User expectations: Post-meeting workflows naturally accommodate seconds or minutes of processing delay.

Async STT: cost and latency impact

Maintaining persistent WebSocket connections at scale adds infrastructure overhead that batch workflows don't require. With async, you POST audio to an endpoint, receive a webhook callback when processing completes, and handle the result. No connection management, no reconnection logic, no partial transcript stitching.

Pricing reflects this architectural simplicity. On our Growth plan, async runs as low as $0.20/hr versus $0.25/hr for real-time streaming. At 10,000 hours per month, that's $2,000 versus $2,500, before accounting for the infrastructure savings from webhook-driven over connection-held architectures.

Customers report saving 20%+ of DevOps sprint capacity after migrating off self-hosted models, capacity previously consumed by GPU provisioning, version management, and stability issues.

Selecting async for meeting summaries

The Claap case study puts this in concrete terms: one hour of video transcribed in under 60 seconds, at 1-3% WER in production across a multilingual international user base with varied accents and code-switching. That's the async advantage. Batch processing provides the full audio context before output, which is exactly what downstream LLM components depend on for reliable summaries.

Data flow in a meeting summarization pipeline

Before writing a line of integration code, map the data flow. A production pipeline has four stages, each of which can silently degrade the stages that follow it.

Preparing audio for transcription

Audio arrives from one of three sources in most meeting assistant architectures: a meeting bot SDK (Recall and MeetingBaaS are bot-based platforms that join and record calendar meetings), a native recording integration, or a direct upload from the end user. The format, sample rate, and file size vary across all three, and your ingestion layer needs to normalize these before they reach the STT API.

Async STT API integration

The STT layer determines WER and DER for everything downstream. Your LLM prompts, CRM syncs, and coaching scorecards all inherit the accuracy of this layer, so getting it right matters. A wrong name in the transcript becomes a wrong name in the CRM entry. A misattributed speaker produces a coaching score built on the wrong agent's words.

We built Solaria-1 to cover 100+ languages with robust code-switching, and 42 of those languages aren't covered by any other API-level STT provider.

For meeting assistants serving multilingual teams, that's the difference between transcripts that hold up across your entire user base and ones that degrade quietly for non-English speakers. Our supported languages documentation lists the full coverage.

LLM for meeting summaries

The LLM stage consumes the structured JSON from the STT layer, specifically the diarized utterances with speaker labels and word-level timestamps. If those are accurate, the LLM can reliably extract action items, decisions, and attributions, but if they're not, the LLM will hallucinate in ways that look plausible enough to be worse than an obvious failure.

Storing pipeline outputs effectively

Store both the raw transcript JSON and the formatted LLM summary. Raw transcripts are your ground truth for debugging and reprocessing. Formatted summaries are what your frontend consumes. Use a structured store (Postgres, DynamoDB) for the JSON and a document store or object storage for the formatted output.

Step 1: Configure meeting audio source

Supported audio formats and conversion

Meeting bots and native recorders deliver audio in inconsistent formats. Before calling any STT API, you need to normalize them. The Gladia async API accepts:

Supported formats: WAV, M4A, FLAC, AAC
URL ingestion: Direct URLs pointing to hosted audio
Format conversion: If your meeting bot delivers audio in an unsupported format like WebM, convert using ffmpeg before ingestion

For example, converting WebM to WAV:

ffmpeg -i input.webm -ar 16000 -ac 1 output.wav

Mono 16kHz is the standard target for STT APIs and reduces file size without meaningful accuracy impact.

File size and duration validation

Before sending audio to the API, validate against our limits: files must not exceed 1,000MB in size and 135 minutes in duration. Validate early and fail loudly so your error handling doesn't surface downstream at the LLM stage:

MAX_FILE_SIZE_MB = 1000
MAX_DURATION_MINUTES = 135

def validate_audio(file_path: str, duration_minutes: float) -> None:
    file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
    if file_size_mb > MAX_FILE_SIZE_MB:
        raise ValueError(f"File size {file_size_mb:.1f}MB exceeds 1000MB limit")
    if duration_minutes > MAX_DURATION_MINUTES:
        raise ValueError(f"Duration {duration_minutes}min exceeds 135-minute limit")

Set up storage and webhook events

Async workflows depend on webhooks for result delivery. Follow this pattern:

Configure a webhook endpoint that can receive POST callbacks from the Gladia API.
Store audio in object storage (S3, GCS) with a pre-signed URL you'll pass to the transcription request.
Never poll for results: polling creates unnecessary load and delays result handling by reintroducing latency proportional to your polling interval.

Set up your audio upload endpoint

Your ingestion endpoint receives the audio file, validates it, stores it, and queues a transcription job. Keep this lightweight by following a simple pattern: validate the file, store it in object storage, enqueue the transcription job, and return 202 Accepted, because the actual transcription work happens asynchronously.

Step 2: Enable fast STT for meeting summaries

API key authentication for STT

API key handling is where most production integrations leak credentials. Pass your Gladia API key in the x-gladia-key header on every request, and never embed keys in client-side code or commit them to version control. Use environment variables or a secrets manager instead.

Initiating async transcription requests

Structure the payload to enable diarization and set realistic speaker count bounds. The audio intelligence features documentation covers all available parameters.

Async result handling: polling vs. webhooks

Webhooks are the correct pattern for async workflows because they eliminate the need for a polling loop that runs every few seconds checking if transcription is complete. Configure callback_url in your request payload and your webhook handler processes the result when it arrives, which keeps your infrastructure costs proportional to actual usage and removes polling latency from your end-to-end time.

Sending async STT API requests

Here's a Python example that sends an async transcription request with diarization and a webhook callback enabled:

import requests
import os

GLADIA_API_KEY = os.environ["GLADIA_API_KEY"]
GLADIA_API_URL = "https://api.gladia.io/v2/pre-recorded"

def initiate_transcription(audio_url: str, webhook_url: str) -> str:
    payload = {
        "audio_url": audio_url,
        "diarization": True,           # Enable speaker attribution
        "diarization_config": {
            "min_speakers": 1,
            "max_speakers": 8          # Adjust based on your meeting size
        },
        "detect_language": True,       # Auto-detect and handle code-switching
        "callback_url": webhook_url,
        "callback_config": {
            "on_complete": True        # Webhook fires when transcription finishes
        }
    }

    headers = {
        "x-gladia-key": GLADIA_API_KEY,
        "Content-Type": "application/json"
    }

    response = requests.post(GLADIA_API_URL, json=payload, headers=headers)
    response.raise_for_status()
    return response.json()["id"]  # Store this for result correlation

The API returns a job ID immediately. Your webhook receives the full transcript JSON when processing completes, typically within 60 seconds for a one-hour recording.

Step 3: Validate speech data for LLM accuracy

Validating speaker attribution

Diarization is the process of identifying which speaker said what and when. For meeting summaries, it's what makes the difference between "someone said we need to move the deadline" and "Sarah said we need to move the deadline, and James agreed."

We run diarization on pyannoteAI's Precision-2 model, which delivers up to 3x lower DER than alternatives, better overlap handling, and cross-language consistency. This is an async-only feature, which is another reason async is the correct architecture for post-meeting summarization.

Precise timestamps for LLM input

Word-level timestamps serve two purposes in a meeting pipeline. First, they enable chapterization (breaking the transcript into logical segments by topic).

Second, they allow the LLM to generate citations that link back to the exact moment in the recording when a decision was made. Extract utterances with their speaker labels and time boundaries before passing to the LLM:

def extract_diarized_utterances(transcript_response: dict) -> list[dict]:
    utterances = []
    for utterance in transcript_response.get("utterances", []):
        utterances.append({
            "speaker": utterance["speaker"],
            "start": utterance["start"],
            "end": utterance["end"],
            "text": utterance["transcript"]
        })
    return utterances

Implementing robust retry strategies

Network failures and API timeouts happen in production. Build your webhook handler to be idempotent so processing the same result twice doesn't create duplicate summaries, and configure your job queue to support at-least-once delivery with exponential backoff. Log the job ID we return in the initial transcription response so your retry logic can check whether the original job completed before re-submitting, and avoid re-queuing jobs that already finished successfully.

Core transcript processing code

Parse the webhook payload and build the formatted string your LLM prompt expects:

def format_transcript_for_llm(utterances: list[dict]) -> str:
    lines = []
    for u in utterances:
        timestamp = f"[{u['start']:.1f}s - {u['end']:.1f}s]"
        lines.append(f"[{u['speaker']}] {timestamp}: {u['text']}")
    return "\n".join(lines)

Step 4: Engineer prompts for meeting summaries

LLM context for accurate summaries

The diarized transcript is your LLM's primary input. Inject it as the user message and constrain the model's output structure via the system prompt. Do not ask the LLM to infer speaker identity from context. If diarization did its job, speaker attribution is already in the transcript, and the LLM should use it directly rather than guessing.

Reusable meeting summary prompts

Design your system prompt to constrain the LLM's output structure so you can parse it reliably. Here's a concrete example that produces consistent, structured JSON output:

SYSTEM_PROMPT = """
You are a meeting analyst. You receive transcripts with speaker labels and timestamps.
Extract the following from each transcript:
- A 3-5 sentence summary of the meeting
- All action items, each with an assignee if identifiable and a due date if mentioned
- All decisions made
- Open questions that were raised but not resolved

Return JSON only. Use this schema:
{
  "summary": "string",
  "action_items": [{"assignee": "string", "task": "string", "due": "string or null"}],
  "decisions": ["string"],
  "open_questions": ["string"]
}
"""

This schema gives you typed data structures you can insert directly into a database without fragile string parsing.

Handling long transcripts with chunking

Most LLM context windows handle 90-minute meetings comfortably in 2026, but for longer recordings or higher speaker counts, chunk by topic using the chapterization timestamps from our summarization feature.

Process each chapter independently and merge the outputs, which also improves summary granularity because the LLM operates on a focused segment rather than a 10,000-token wall of text.

LLM summary API call example

import anthropic
import json

def generate_meeting_summary(transcript: str) -> dict:
    client = anthropic.Anthropic()

    response = client.messages.create(
        model="claude-opus-4-6",  # Released February 2026, recommended for complex extraction
        max_tokens=2048,
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": transcript}]
    )

    return json.loads(response.content[0].text)

Step 5: Deliver actionable meeting insights

Organizing LLM summary output

The JSON output from your LLM call is your internal representation. Don't expose raw JSON to the frontend. Translate it to the format each consumer expects: Markdown for in-app note views, a structured payload for CRM webhooks, and a summary string for email notifications.

Pinpointing meeting action items

Our Named Entity Recognition (NER) runs on the transcript and tags entities like names, organizations, and dates before the LLM stage. Use these tags to cross-reference assignees in your action item extraction. If NER identifies "Marcus" as a person entity and the LLM extracts an action item for "Marcus," you have a high-confidence match to a specific contact record.

Output format options: JSON, Markdown, PDF

Match your output format to the consuming system:

JSON: CRM webhooks, internal APIs, database storage
Markdown: In-app note views, Notion, Confluence
PDF: Compliance archives, regulated industries requiring immutable records

Code for output delivery pipeline

from datetime import datetime
import json

def save_meeting_summary(meeting_id: str, summary: dict, db_client) -> None:
    db_client.table("meeting_summaries").upsert({
        "meeting_id": meeting_id,
        "summary": summary["summary"],
        "action_items": json.dumps(summary["action_items"]),
        "decisions": json.dumps(summary["decisions"]),
        "open_questions": json.dumps(summary["open_questions"]),
        "processed_at": datetime.utcnow().isoformat()
    })

Operationalizing your meeting AI solution

Choosing queue vs. serverless for your pipeline

Meeting ends cluster at the top of the hour and can produce large bursts of simultaneous transcription jobs. A job queue (SQS, Pub/Sub, RabbitMQ) absorbs these spikes without over-provisioning. Serverless execution (Lambda, Cloud Run) works well for the LLM stage where execution time is bounded and predictable. The two patterns complement each other: queue for ingestion and transcription job management, serverless for summary generation on each completed transcript.

Production latency for STT/LLM pipeline

Set user-facing expectations based on realistic end-to-end timings. For a one-hour meeting:

Async transcription: ~60 seconds
LLM summary generation: Varies by model and transcript length
Total end-to-end: Typically under 2 minutes from meeting end to available summary

That's well within what users accept for post-meeting notes, where the expectation is "summary appears before I switch contexts" rather than "summary appears instantly." For common mistakes that affect this latency, see our implementation guide.

Optimizing summarization pipeline costs

Model your costs at realistic volume before committing to a plan. On our Growth tier:

10,000 hours/month of async transcription at $0.20/hr = $2,000/month
All audio intelligence features included: diarization, NER, translation, sentiment, custom vocabulary
No add-on fees for individual features, which eliminates the invoice surprises common with per-feature metering

Compare this to stacking providers: one for transcription, one for diarization, one for NER. Each adds per-feature cost, a new API contract, and another integration point where data can degrade. Our pricing page shows current per-hour rates for each plan.

Retry strategies for pipeline reliability

Every transcription job should have a dead-letter queue for failed or timed-out jobs. Log the job ID we return at initiation so your retry logic can check whether the original job completed before re-submitting.

Our infrastructure runs at 99.9%+ uptime, which you can verify on our status page, but your retry logic should be robust regardless of provider.

Addressing common pipeline implementation challenges

Optimizing async STT latency

We process approximately one hour of audio in under a minute in batch mode. For a 30-minute team standup, expect results in under 30 seconds. For a 90-minute design review, under 90 seconds.

These numbers hold at scale because we spin up parallel processing capacity without pre-provisioning on your side, so there's no capacity planning or burst quotas to manage.

Forecasting pipeline costs per hour

Build your cost model at three volume points: current, 5x, and 10x. Use this as a starting framework:

Growth plan cost model: async transcription at $0.20/hr

Volume	Async hours/month	Cost at $0.20/hr	Features included
1,000 hrs/mo	1,000	$200	Diarization, NER, translation, sentiment, custom vocabulary
5,000 hrs/mo	5,000	$1,000	All above
10,000 hrs/mo	10,000	$2,000	All above

‍

The STT layer stays linear because pricing is per hour of audio. Add your LLM costs separately based on the token volume your transcript density generates and the model you choose.

Building multilingual STT pipelines

For teams serving multilingual users, the transcription layer is where language support either holds or breaks. We built Solaria-1 to cover 100+ languages with up to 29% lower WER than alternatives on conversational speech, a figure validated across 8 providers, 7 datasets, and 74+ hours of audio, as detailed in our benchmark methodology and true code-switching support for mid-conversation language changes.

That 42-language coverage advantage over competing APIs matters specifically for CCaaS and meeting assistant teams with users in Southeast Asia, South Asia, and Latin America.

On-premise pipeline: build vs. managed?

The honest TCO calculation for self-hosting an open-source STT model at production scale includes GPU compute, DevOps engineering time for provisioning and version management, a separate diarization provider (since most open-source STT models don't include production-grade speaker attribution), and the WER gap.

Based on our benchmark testing across 74+ hours of audio, open-source STT models in our comparison averaged over 10% WER on conversational speech, which means transcription errors compound into every downstream system your LLM touches.

Here's how the trade-offs break down:

Factor	Self-hosted open-source STT	Gladia managed API (Growth)
Licensing cost	$0	$0.20/hr (async)
GPU compute (1,000 hrs/mo)	Variable, dedicated GPU required	Included
DevOps engineering overhead	20%+ of sprint capacity	$0
Diarization	Separate provider or none	Included (pyannoteAI Precision-2)
WER on conversational speech	10%+ (benchmark avg.)	Up to 29% lower than alternatives
SOC 2 Type II, GDPR compliance	Build and certify yourself	Included
Data retraining risk	Depends on provider	Never on Growth/Enterprise

‍

The compliance column is where self-hosting tends to be underestimated. If your product handles regulated conversations, achieving SOC 2 Type II certification for a self-hosted audio pipeline requires significant engineering investment that rarely appears in initial build-vs-buy estimates.

On Growth and Enterprise plans, your audio data is never used to retrain our models. No opt-out is required, and no contract clause needs to be found. On the Starter plan, data is used for training by default.

For teams evaluating this decision, the AI note-taker architecture guide walks through the full stack comparison with production configurations.

And for teams using async transcription to power downstream sales intelligence workflows, the Attention x Gladia integration webinar shows how the same async pipeline powers CRM population, coaching scorecards, and conversation analytics from a single transcription call.

The async STT layer is the foundation your meeting AI product stands on. Get it right and every downstream component inherits that accuracy: summaries, action items, CRM entries, coaching scores. Get it wrong and you're debugging LLM hallucinations that originate in the transcription layer, which is the harder problem to track down because the failure mode looks like a prompt engineering issue rather than a data quality issue.

Test Gladia on your own multilingual audio to see how it handles language detection, accent-heavy speech, and code-switching. Start with 10 free hours and have your integration in production in less than a day.

FAQs

How long does async transcription take for a 60-minute meeting?

We process approximately one hour of audio in under a minute in batch mode, so a 60-minute recording typically completes in under 60 seconds. Processing time scales efficiently with audio duration rather than increasing disproportionately for longer files.

Does Gladia use my meeting audio to train its models?

On Growth and Enterprise plans, your audio data is never used for model training and no opt-out action is required. On the Starter plan, customer data can be used for model training by default.

What is the maximum file size for the async STT API?

The API accepts files up to 1,000MB in size with a maximum supported audio duration of 135 minutes per file.

What diarization technology does Gladia use, and is it available in real-time mode?

We run diarization on pyannoteAI's Precision-2 model, which delivers up to 3x lower DER than alternatives and better overlap handling. Diarization is only available in async (batch) workflows, not in real-time streaming mode, which is one of the primary architectural reasons to use async for post-meeting summarization.

How does WER at the STT layer affect LLM summary quality?

WER at the transcription layer sets a hard accuracy ceiling for every downstream component. A wrong word in the transcript becomes a wrong word in the LLM's input, and LLMs confidently generate plausible-sounding output from incorrect premises, producing hallucinations that are specific enough to look credible. High WER on conversational audio leads to misattributed action items, wrong names in CRM entries, and decisions in summaries that were never made. Our benchmark methodology shows Solaria-1 achieving up to 29% lower WER than alternatives on conversational speech, which directly narrows the window for downstream errors.

What is the all-in cost for 10,000 hours per month of async transcription?

On our Growth plan at $0.20/hr, 10,000 hours per month totals $2,000. This includes diarization (pyannoteAI Precision-2), translation, sentiment analysis, NER, summarization, custom vocabulary, and code-switching detection with no add-on fees.

What programming languages does the Gladia SDK support?

We provide official SDKs for Python and JavaScript/TypeScript. You can also integrate directly via REST API from any language that supports HTTP requests, since the API surface is standard JSON over HTTPS.

Can I process multiple audio files concurrently?

Yes, our API handles concurrent requests without pre-provisioning on your side. You can submit multiple transcription jobs simultaneously and we'll process them in parallel, which is how Aircall runs 1M+ calls/week through the same async pipeline.

Key terms glossary

Word error rate (WER): The standard metric for measuring STT accuracy, calculated by adding substitutions, deletions, and insertions, then dividing by the total reference word count. Lower WER directly correlates with fewer LLM hallucinations in downstream summaries.

Diarization error rate (DER): The metric measuring speaker attribution accuracy in audio, calculated as the percentage of audio time assigned to the wrong speaker. Accurate DER is critical for generating correct action item attributions in meeting summaries.

Code-switching: Mid-conversation alternation between two or more languages or dialects. Robust STT models detect and transcribe these shifts automatically without breaking the session or producing garbled output.

Async (batch) transcription: A transcription mode where the full audio file is uploaded and processed before output is returned. Batch mode provides full audio context, which improves accuracy, diarization, and multilingual consistency compared to real-time streaming.

Diarization: The process of segmenting audio by speaker identity, producing a transcript where each utterance is attributed to a specific speaker label. Required for accurate action item extraction and meeting intelligence in multi-speaker recordings.

Data residency: The geographic constraint on where audio data is stored and processed. Relevant for GDPR compliance and enterprise customer contracts that require data to remain within specific jurisdictions.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Product News

Gladia CLI: transcribe audio from your terminal in one command

Speech-To-Text

Call center transcription software: what enterprises should look for in 2026

Speech-To-Text

PII redaction for call recordings: how ingestion-level redaction keeps calls PCI compliant

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Read more

Gladia CLI: transcribe audio from your terminal in one command

Call center transcription software: what enterprises should look for in 2026

PII redaction for call recordings: how ingestion-level redaction keeps calls PCI compliant

Building a meeting summarization pipeline: async STT + LLM in 5 steps

Why async transcription matters for production summarization

STT mode selection for your pipeline

Async STT: cost and latency impact

Selecting async for meeting summaries

Data flow in a meeting summarization pipeline

Preparing audio for transcription

Async STT API integration

LLM for meeting summaries

Storing pipeline outputs effectively

Step 1: Configure meeting audio source

Supported audio formats and conversion

File size and duration validation

Set up storage and webhook events

Set up your audio upload endpoint

Step 2: Enable fast STT for meeting summaries

API key authentication for STT

Initiating async transcription requests

Async result handling: polling vs. webhooks

Sending async STT API requests

Step 3: Validate speech data for LLM accuracy

Validating speaker attribution

Precise timestamps for LLM input

Implementing robust retry strategies

Core transcript processing code

Step 4: Engineer prompts for meeting summaries

LLM context for accurate summaries

Reusable meeting summary prompts

Handling long transcripts with chunking

LLM summary API call example

Step 5: Deliver actionable meeting insights

Organizing LLM summary output

Pinpointing meeting action items

Output format options: JSON, Markdown, PDF

Code for output delivery pipeline

Operationalizing your meeting AI solution

Choosing queue vs. serverless for your pipeline

Production latency for STT/LLM pipeline

Optimizing summarization pipeline costs

Retry strategies for pipeline reliability

Addressing common pipeline implementation challenges

Optimizing async STT latency

Forecasting pipeline costs per hour

Building multilingual STT pipelines

On-premise pipeline: build vs. managed?

FAQs

How long does async transcription take for a 60-minute meeting?

Does Gladia use my meeting audio to train its models?

What is the maximum file size for the async STT API?

What diarization technology does Gladia use, and is it available in real-time mode?

How does WER at the STT layer affect LLM summary quality?

What is the all-in cost for 10,000 hours per month of async transcription?

What programming languages does the Gladia SDK support?

Can I process multiple audio files concurrently?

Key terms glossary

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.