Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Read more

Speech-To-Text

Building a meeting summarization pipeline: async STT + LLM in 5 steps

Building a meeting summarization pipeline with async STT and LLM in 5 steps: audio ingestion, API integration, and prompt engineering.

Speech-To-Text

Real-time latency for meeting transcription: latency budgets and live note-taking requirements

Real-time latency for meeting transcription requires measuring end-to-end delays across audio chunking, network routing, and rendering.

Speech-To-Text

Handling transcription hallucinations in meeting notes: detection and mitigation strategies

Handling transcription hallucinations in meeting notes requires confidence scoring, LLM validation, and async STT to catch errors.

Building a meeting summarization pipeline: async STT + LLM in 5 steps

Published on Sep 25, 2026
by Ani Ghazaryan
Building a meeting summarization pipeline: async STT + LLM in 5 steps

Building a meeting summarization pipeline with async STT and LLM in 5 steps: audio ingestion, API integration, and prompt engineering.

TL;DR: Building a meeting summarization pipeline means pairing an async STT API with an LLM in a sequence where the transcription layer sets the accuracy ceiling for everything downstream. Async STT gives the model full audio context before producing output. This guide covers 5 concrete steps: configuring audio ingestion, integrating an async STT API via webhooks, validating diarized speech data, engineering LLM prompts, and formatting output. We built Solaria-1 to handle the STT layer with 100+ languages, true code-switching, and predictable per-hour pricing, with no model retraining on Growth and Enterprise plans, so you can ship to production in under a day.

Most engineering teams spend weeks tuning large language model (LLM) prompts for meeting summaries, only to discover that the real bottleneck is the transcription layer. If your STT fails on an accented speaker or drops a mid-sentence language switch, no amount of prompt engineering will fix the hallucinations that propagate into summaries, action items, and CRM entries. You'll find out when users start complaining about inaccurate meeting notes.

This guide breaks down how to build a production-ready meeting summarization pipeline using asynchronous STT and LLMs. We cover audio ingestion, transcript processing, prompt engineering, and output formatting, showing how to replace a fragmented infrastructure stack with a single API call.

Why async transcription matters for production summarization

When you're choosing between real-time and batch transcription for meeting summaries, the decision isn't primarily about user-facing latency. For post-meeting summaries, users accept a delay of seconds or minutes in exchange for output that's reliable. The real choice is about accuracy, diarization quality, and infrastructure TCO.

STT mode selection for your pipeline

Real-time streaming is the right tool for live captions and voice agent turn-detection. For meeting summaries, it forces a trade-off you don't need to make: partial transcripts lack the full audio context needed for accurate punctuation, word disambiguation, and speaker boundary detection.

Batch processing analyzes the complete recording before generating output, which directly improves all three. Here's how the two modes compare for this use case:

Mode Latency Accuracy ceiling Diarization Best for
Async (batch) ~60s per hour of audio Higher (full context) pyannoteAI Precision-2 (async only) Meeting summaries, post-call analytics, note-takers
Real-time (streaming) ~300ms final transcript Lower (partial context) Not available (async only) Live captions, voice agent assist

For meeting summarization, three factors matter more than latency:

  • Accuracy ceiling: A summary generated 60 seconds after the meeting ends is indistinguishable from one generated 10 minutes after from a user experience standpoint.
  • WER and DER impact: Summary quality depends on transcription accuracy, not on how fast the transcription arrives.
  • User expectations: Post-meeting workflows naturally accommodate seconds or minutes of processing delay.

Async STT: cost and latency impact

Maintaining persistent WebSocket connections at scale adds infrastructure overhead that batch workflows don't require. With async, you POST audio to an endpoint, receive a webhook callback when processing completes, and handle the result. No connection management, no reconnection logic, no partial transcript stitching.

Pricing reflects this architectural simplicity. On our Growth plan, async runs as low as $0.20/hr versus $0.25/hr for real-time streaming. At 10,000 hours per month, that's $2,000 versus $2,500, before accounting for the infrastructure savings from webhook-driven over connection-held architectures.

Customers report saving 20%+ of DevOps sprint capacity after migrating off self-hosted models, capacity previously consumed by GPU provisioning, version management, and stability issues.

Selecting async for meeting summaries

The Claap case study puts this in concrete terms: one hour of video transcribed in under 60 seconds, at 1-3% WER in production across a multilingual international user base with varied accents and code-switching. That's the async advantage. Batch processing provides the full audio context before output, which is exactly what downstream LLM components depend on for reliable summaries.

Data flow in a meeting summarization pipeline

Before writing a line of integration code, map the data flow. A production pipeline has four stages, each of which can silently degrade the stages that follow it.

Preparing audio for transcription

Audio arrives from one of three sources in most meeting assistant architectures: a meeting bot SDK (Recall and MeetingBaaS are bot-based platforms that join and record calendar meetings), a native recording integration, or a direct upload from the end user. The format, sample rate, and file size vary across all three, and your ingestion layer needs to normalize these before they reach the STT API.

Async STT API integration

The STT layer determines WER and DER for everything downstream. Your LLM prompts, CRM syncs, and coaching scorecards all inherit the accuracy of this layer, so getting it right matters. A wrong name in the transcript becomes a wrong name in the CRM entry. A misattributed speaker produces a coaching score built on the wrong agent's words.

We built Solaria-1 to cover 100+ languages with robust code-switching, and 42 of those languages aren't covered by any other API-level STT provider.

For meeting assistants serving multilingual teams, that's the difference between transcripts that hold up across your entire user base and ones that degrade quietly for non-English speakers. Our supported languages documentation lists the full coverage.

LLM for meeting summaries

The LLM stage consumes the structured JSON from the STT layer, specifically the diarized utterances with speaker labels and word-level timestamps. If those are accurate, the LLM can reliably extract action items, decisions, and attributions, but if they're not, the LLM will hallucinate in ways that look plausible enough to be worse than an obvious failure.

Storing pipeline outputs effectively

Store both the raw transcript JSON and the formatted LLM summary. Raw transcripts are your ground truth for debugging and reprocessing. Formatted summaries are what your frontend consumes. Use a structured store (Postgres, DynamoDB) for the JSON and a document store or object storage for the formatted output.

Step 1: Configure meeting audio source

Supported audio formats and conversion

Meeting bots and native recorders deliver audio in inconsistent formats. Before calling any STT API, you need to normalize them. The Gladia async API accepts:

  • Supported formats: WAV, M4A, FLAC, AAC
  • URL ingestion: Direct URLs pointing to hosted audio
  • Format conversion: If your meeting bot delivers audio in an unsupported format like WebM, convert using ffmpeg before ingestion

For example, converting WebM to WAV:

ffmpeg -i input.webm -ar 16000 -ac 1 output.wav

Mono 16kHz is the standard target for STT APIs and reduces file size without meaningful accuracy impact.

File size and duration validation

Before sending audio to the API, validate against our limits: files must not exceed 1,000MB in size and 135 minutes in duration. Validate early and fail loudly so your error handling doesn't surface downstream at the LLM stage:

MAX_FILE_SIZE_MB = 1000
MAX_DURATION_MINUTES = 135

def validate_audio(file_path: str, duration_minutes: float) -> None:
    file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
    if file_size_mb > MAX_FILE_SIZE_MB:
        raise ValueError(f"File size {file_size_mb:.1f}MB exceeds 1000MB limit")
    if duration_minutes > MAX_DURATION_MINUTES:
        raise ValueError(f"Duration {duration_minutes}min exceeds 135-minute limit")

Set up storage and webhook events

Async workflows depend on webhooks for result delivery. Follow this pattern:

  1. Configure a webhook endpoint that can receive POST callbacks from the Gladia API.
  2. Store audio in object storage (S3, GCS) with a pre-signed URL you'll pass to the transcription request.
  3. Never poll for results: polling creates unnecessary load and delays result handling by reintroducing latency proportional to your polling interval.

Set up your audio upload endpoint

Your ingestion endpoint receives the audio file, validates it, stores it, and queues a transcription job. Keep this lightweight by following a simple pattern: validate the file, store it in object storage, enqueue the transcription job, and return 202 Accepted, because the actual transcription work happens asynchronously.

Step 2: Enable fast STT for meeting summaries

API key authentication for STT

API key handling is where most production integrations leak credentials. Pass your Gladia API key in the x-gladia-key header on every request, and never embed keys in client-side code or commit them to version control. Use environment variables or a secrets manager instead.

Initiating async transcription requests

Structure the payload to enable diarization and set realistic speaker count bounds. The audio intelligence features documentation covers all available parameters.

Async result handling: polling vs. webhooks

Webhooks are the correct pattern for async workflows because they eliminate the need for a polling loop that runs every few seconds checking if transcription is complete. Configure callback_url in your request payload and your webhook handler processes the result when it arrives, which keeps your infrastructure costs proportional to actual usage and removes polling latency from your end-to-end time.

Sending async STT API requests

Here's a Python example that sends an async transcription request with diarization and a webhook callback enabled:

import requests
import os

GLADIA_API_KEY = os.environ["GLADIA_API_KEY"]
GLADIA_API_URL = "https://api.gladia.io/v2/pre-recorded"

def initiate_transcription(audio_url: str, webhook_url: str) -> str:
    payload = {
        "audio_url": audio_url,
        "diarization": True,           # Enable speaker attribution
        "diarization_config": {
            "min_speakers": 1,
            "max_speakers": 8          # Adjust based on your meeting size
        },
        "detect_language": True,       # Auto-detect and handle code-switching
        "callback_url": webhook_url,
        "callback_config": {
            "on_complete": True        # Webhook fires when transcription finishes
        }
    }

    headers = {
        "x-gladia-key": GLADIA_API_KEY,
        "Content-Type": "application/json"
    }

    response = requests.post(GLADIA_API_URL, json=payload, headers=headers)
    response.raise_for_status()
    return response.json()["id"]  # Store this for result correlation

The API returns a job ID immediately. Your webhook receives the full transcript JSON when processing completes, typically within 60 seconds for a one-hour recording.

Step 3: Validate speech data for LLM accuracy

Validating speaker attribution

Diarization is the process of identifying which speaker said what and when. For meeting summaries, it's what makes the difference between "someone said we need to move the deadline" and "Sarah said we need to move the deadline, and James agreed."

We run diarization on pyannoteAI's Precision-2 model, which delivers up to 3x lower DER than alternatives, better overlap handling, and cross-language consistency. This is an async-only feature, which is another reason async is the correct architecture for post-meeting summarization.

Precise timestamps for LLM input

Word-level timestamps serve two purposes in a meeting pipeline. First, they enable chapterization (breaking the transcript into logical segments by topic).

Second, they allow the LLM to generate citations that link back to the exact moment in the recording when a decision was made. Extract utterances with their speaker labels and time boundaries before passing to the LLM:

def extract_diarized_utterances(transcript_response: dict) -> list[dict]:
    utterances = []
    for utterance in transcript_response.get("utterances", []):
        utterances.append({
            "speaker": utterance["speaker"],
            "start": utterance["start"],
            "end": utterance["end"],
            "text": utterance["transcript"]
        })
    return utterances

Implementing robust retry strategies

Network failures and API timeouts happen in production. Build your webhook handler to be idempotent so processing the same result twice doesn't create duplicate summaries, and configure your job queue to support at-least-once delivery with exponential backoff. Log the job ID we return in the initial transcription response so your retry logic can check whether the original job completed before re-submitting, and avoid re-queuing jobs that already finished successfully.

Core transcript processing code

Parse the webhook payload and build the formatted string your LLM prompt expects:

def format_transcript_for_llm(utterances: list[dict]) -> str:
    lines = []
    for u in utterances:
        timestamp = f"[{u['start']:.1f}s - {u['end']:.1f}s]"
        lines.append(f"[{u['speaker']}] {timestamp}: {u['text']}")
    return "\n".join(lines)

Step 4: Engineer prompts for meeting summaries

LLM context for accurate summaries

The diarized transcript is your LLM's primary input. Inject it as the user message and constrain the model's output structure via the system prompt. Do not ask the LLM to infer speaker identity from context. If diarization did its job, speaker attribution is already in the transcript, and the LLM should use it directly rather than guessing.

Reusable meeting summary prompts

Design your system prompt to constrain the LLM's output structure so you can parse it reliably. Here's a concrete example that produces consistent, structured JSON output:

SYSTEM_PROMPT = """
You are a meeting analyst. You receive transcripts with speaker labels and timestamps.
Extract the following from each transcript:
- A 3-5 sentence summary of the meeting
- All action items, each with an assignee if identifiable and a due date if mentioned
- All decisions made
- Open questions that were raised but not resolved

Return JSON only. Use this schema:
{
  "summary": "string",
  "action_items": [{"assignee": "string", "task": "string", "due": "string or null"}],
  "decisions": ["string"],
  "open_questions": ["string"]
}
"""

This schema gives you typed data structures you can insert directly into a database without fragile string parsing.

Handling long transcripts with chunking

Most LLM context windows handle 90-minute meetings comfortably in 2026, but for longer recordings or higher speaker counts, chunk by topic using the chapterization timestamps from our summarization feature.

Process each chapter independently and merge the outputs, which also improves summary granularity because the LLM operates on a focused segment rather than a 10,000-token wall of text.

LLM summary API call example

import anthropic
import json

def generate_meeting_summary(transcript: str) -> dict:
    client = anthropic.Anthropic()

    response = client.messages.create(
        model="claude-opus-4-6",  # Released February 2026, recommended for complex extraction
        max_tokens=2048,
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": transcript}]
    )

    return json.loads(response.content[0].text)

Step 5: Deliver actionable meeting insights

Organizing LLM summary output

The JSON output from your LLM call is your internal representation. Don't expose raw JSON to the frontend. Translate it to the format each consumer expects: Markdown for in-app note views, a structured payload for CRM webhooks, and a summary string for email notifications.

Pinpointing meeting action items

Our Named Entity Recognition (NER) runs on the transcript and tags entities like names, organizations, and dates before the LLM stage. Use these tags to cross-reference assignees in your action item extraction. If NER identifies "Marcus" as a person entity and the LLM extracts an action item for "Marcus," you have a high-confidence match to a specific contact record.

Output format options: JSON, Markdown, PDF

Match your output format to the consuming system:

  • JSON: CRM webhooks, internal APIs, database storage
  • Markdown: In-app note views, Notion, Confluence
  • PDF: Compliance archives, regulated industries requiring immutable records

Code for output delivery pipeline

from datetime import datetime
import json

def save_meeting_summary(meeting_id: str, summary: dict, db_client) -> None:
    db_client.table("meeting_summaries").upsert({
        "meeting_id": meeting_id,
        "summary": summary["summary"],
        "action_items": json.dumps(summary["action_items"]),
        "decisions": json.dumps(summary["decisions"]),
        "open_questions": json.dumps(summary["open_questions"]),
        "processed_at": datetime.utcnow().isoformat()
    })

Operationalizing your meeting AI solution

Choosing queue vs. serverless for your pipeline

Meeting ends cluster at the top of the hour and can produce large bursts of simultaneous transcription jobs. A job queue (SQS, Pub/Sub, RabbitMQ) absorbs these spikes without over-provisioning. Serverless execution (Lambda, Cloud Run) works well for the LLM stage where execution time is bounded and predictable. The two patterns complement each other: queue for ingestion and transcription job management, serverless for summary generation on each completed transcript.

Production latency for STT/LLM pipeline

Set user-facing expectations based on realistic end-to-end timings. For a one-hour meeting:

  • Async transcription: ~60 seconds
  • LLM summary generation: Varies by model and transcript length
  • Total end-to-end: Typically under 2 minutes from meeting end to available summary

That's well within what users accept for post-meeting notes, where the expectation is "summary appears before I switch contexts" rather than "summary appears instantly." For common mistakes that affect this latency, see our implementation guide.

Optimizing summarization pipeline costs

Model your costs at realistic volume before committing to a plan. On our Growth tier:

  • 10,000 hours/month of async transcription at $0.20/hr = $2,000/month
  • All audio intelligence features included: diarization, NER, translation, sentiment, custom vocabulary
  • No add-on fees for individual features, which eliminates the invoice surprises common with per-feature metering

Compare this to stacking providers: one for transcription, one for diarization, one for NER. Each adds per-feature cost, a new API contract, and another integration point where data can degrade. Our pricing page shows current per-hour rates for each plan.

Retry strategies for pipeline reliability

Every transcription job should have a dead-letter queue for failed or timed-out jobs. Log the job ID we return at initiation so your retry logic can check whether the original job completed before re-submitting.

Our infrastructure runs at 99.9%+ uptime, which you can verify on our status page, but your retry logic should be robust regardless of provider.

Addressing common pipeline implementation challenges

Optimizing async STT latency

We process approximately one hour of audio in under a minute in batch mode. For a 30-minute team standup, expect results in under 30 seconds. For a 90-minute design review, under 90 seconds.

These numbers hold at scale because we spin up parallel processing capacity without pre-provisioning on your side, so there's no capacity planning or burst quotas to manage.

Forecasting pipeline costs per hour

Build your cost model at three volume points: current, 5x, and 10x. Use this as a starting framework:

Growth plan cost model: async transcription at $0.20/hr

Volume Async hours/month Cost at $0.20/hr Features included
1,000 hrs/mo 1,000 $200 Diarization, NER, translation, sentiment, custom vocabulary
5,000 hrs/mo 5,000 $1,000 All above
10,000 hrs/mo 10,000 $2,000 All above

The STT layer stays linear because pricing is per hour of audio. Add your LLM costs separately based on the token volume your transcript density generates and the model you choose.

Building multilingual STT pipelines

For teams serving multilingual users, the transcription layer is where language support either holds or breaks. We built Solaria-1 to cover 100+ languages with up to 29% lower WER than alternatives on conversational speech, a figure validated across 8 providers, 7 datasets, and 74+ hours of audio, as detailed in our benchmark methodology and true code-switching support for mid-conversation language changes.

That 42-language coverage advantage over competing APIs matters specifically for CCaaS and meeting assistant teams with users in Southeast Asia, South Asia, and Latin America.

On-premise pipeline: build vs. managed?

The honest TCO calculation for self-hosting an open-source STT model at production scale includes GPU compute, DevOps engineering time for provisioning and version management, a separate diarization provider (since most open-source STT models don't include production-grade speaker attribution), and the WER gap.

Based on our benchmark testing across 74+ hours of audio, open-source STT models in our comparison averaged over 10% WER on conversational speech, which means transcription errors compound into every downstream system your LLM touches.

Here's how the trade-offs break down:

Factor Self-hosted open-source STT Gladia managed API (Growth)
Licensing cost $0 $0.20/hr (async)
GPU compute (1,000 hrs/mo) Variable, dedicated GPU required Included
DevOps engineering overhead 20%+ of sprint capacity $0
Diarization Separate provider or none Included (pyannoteAI Precision-2)
WER on conversational speech 10%+
(benchmark avg.)
Up to 29% lower than alternatives
SOC 2 Type II, GDPR compliance Build and certify yourself Included
Data retraining risk Depends on provider Never on Growth/Enterprise

The compliance column is where self-hosting tends to be underestimated. If your product handles regulated conversations, achieving SOC 2 Type II certification for a self-hosted audio pipeline requires significant engineering investment that rarely appears in initial build-vs-buy estimates.

On Growth and Enterprise plans, your audio data is never used to retrain our models. No opt-out is required, and no contract clause needs to be found. On the Starter plan, data is used for training by default.

For teams evaluating this decision, the AI note-taker architecture guide walks through the full stack comparison with production configurations.

And for teams using async transcription to power downstream sales intelligence workflows, the Attention x Gladia integration webinar shows how the same async pipeline powers CRM population, coaching scorecards, and conversation analytics from a single transcription call.

The async STT layer is the foundation your meeting AI product stands on. Get it right and every downstream component inherits that accuracy: summaries, action items, CRM entries, coaching scores. Get it wrong and you're debugging LLM hallucinations that originate in the transcription layer, which is the harder problem to track down because the failure mode looks like a prompt engineering issue rather than a data quality issue.

Test Gladia on your own multilingual audio to see how it handles language detection, accent-heavy speech, and code-switching. Start with 10 free hours and have your integration in production in less than a day.

FAQs

How long does async transcription take for a 60-minute meeting?

We process approximately one hour of audio in under a minute in batch mode, so a 60-minute recording typically completes in under 60 seconds. Processing time scales efficiently with audio duration rather than increasing disproportionately for longer files.

Does Gladia use my meeting audio to train its models?

On Growth and Enterprise plans, your audio data is never used for model training and no opt-out action is required. On the Starter plan, customer data can be used for model training by default.

What is the maximum file size for the async STT API?

The API accepts files up to 1,000MB in size with a maximum supported audio duration of 135 minutes per file.

What diarization technology does Gladia use, and is it available in real-time mode?

We run diarization on pyannoteAI's Precision-2 model, which delivers up to 3x lower DER than alternatives and better overlap handling. Diarization is only available in async (batch) workflows, not in real-time streaming mode, which is one of the primary architectural reasons to use async for post-meeting summarization.

How does WER at the STT layer affect LLM summary quality?

WER at the transcription layer sets a hard accuracy ceiling for every downstream component. A wrong word in the transcript becomes a wrong word in the LLM's input, and LLMs confidently generate plausible-sounding output from incorrect premises, producing hallucinations that are specific enough to look credible. High WER on conversational audio leads to misattributed action items, wrong names in CRM entries, and decisions in summaries that were never made. Our benchmark methodology shows Solaria-1 achieving up to 29% lower WER than alternatives on conversational speech, which directly narrows the window for downstream errors.

What is the all-in cost for 10,000 hours per month of async transcription?

On our Growth plan at $0.20/hr, 10,000 hours per month totals $2,000. This includes diarization (pyannoteAI Precision-2), translation, sentiment analysis, NER, summarization, custom vocabulary, and code-switching detection with no add-on fees.

What programming languages does the Gladia SDK support?

We provide official SDKs for Python and JavaScript/TypeScript. You can also integrate directly via REST API from any language that supports HTTP requests, since the API surface is standard JSON over HTTPS.

Can I process multiple audio files concurrently?

Yes, our API handles concurrent requests without pre-provisioning on your side. You can submit multiple transcription jobs simultaneously and we'll process them in parallel, which is how Aircall runs 1M+ calls/week through the same async pipeline.

Key terms glossary

Word error rate (WER): The standard metric for measuring STT accuracy, calculated by adding substitutions, deletions, and insertions, then dividing by the total reference word count. Lower WER directly correlates with fewer LLM hallucinations in downstream summaries.

Diarization error rate (DER): The metric measuring speaker attribution accuracy in audio, calculated as the percentage of audio time assigned to the wrong speaker. Accurate DER is critical for generating correct action item attributions in meeting summaries.

Code-switching: Mid-conversation alternation between two or more languages or dialects. Robust STT models detect and transcribe these shifts automatically without breaking the session or producing garbled output.

Async (batch) transcription: A transcription mode where the full audio file is uploaded and processed before output is returned. Batch mode provides full audio context, which improves accuracy, diarization, and multilingual consistency compared to real-time streaming.

Diarization: The process of segmenting audio by speaker identity, producing a transcript where each utterance is attributed to a specific speaker label. Required for accurate action item extraction and meeting intelligence in multi-speaker recordings.

Data residency: The geographic constraint on where audio data is stored and processed. Relevant for GDPR compliance and enterprise customer contracts that require data to remain within specific jurisdictions.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more