Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

Text link

Bold text

Emphasis

Superscript

Subscript

Read more

Speech-To-Text

Building a meeting summarization pipeline: async STT + LLM in 5 steps

Building a meeting summarization pipeline with async STT and LLM in 5 steps: audio ingestion, API integration, and prompt engineering.

Speech-To-Text

Real-time latency for meeting transcription: latency budgets and live note-taking requirements

Real-time latency for meeting transcription requires measuring end-to-end delays across audio chunking, network routing, and rendering.

Speech-To-Text

Handling transcription hallucinations in meeting notes: detection and mitigation strategies

Handling transcription hallucinations in meeting notes requires confidence scoring, LLM validation, and async STT to catch errors.

Real-time latency for meeting transcription: latency budgets and live note-taking requirements

Published on April 17, 2026
by Ani Ghazaryan
Real-time latency for meeting transcription: latency budgets and live note-taking requirements

Real-time latency for meeting transcription requires measuring end-to-end delays across audio chunking, network routing, and rendering.

TL;DR: For final post-meeting notes, async transcription delivers higher accuracy, reliable speaker attribution via batch diarization, and lower cost per hour than streaming. For the live UX layer during calls, end-to-end latency is not just STT inference speed, it covers audio chunking, network routing, model processing, and display rendering combined. Network lag and client-side rendering consume your transcription budget before users see a word. Keep end-to-end latency under 500ms for perceived responsiveness in live note-taking. We provide both workflows from one API, removing the integration overhead of stitching together separate providers.

A 300ms transcription budget disappears into network lag and client-side rendering before a user sees a single word. Most teams obsess over a provider's inference speed and ignore where latency accumulates: chunking strategy, network routing, and rendering delays that have nothing to do with the model.

This guide gives product and engineering teams a framework to measure the full latency budget, make deliberate trade-offs between real-time UX and async accuracy, and model the cost implications of each architectural choice.

What is transcription latency and why does it matter?

Transcription latency is the total elapsed time between a speaker finishing a phrase and text appearing on screen. Most teams measure only STT inference time, but that single number represents one stage in a five-stage pipeline where every component adds delay.

Drivers of live transcription latency

The end-to-end pipeline breaks into five quantifiable components:

  • Client audio capture and chunking: 50-250ms from buffer accumulation before audio is sent
  • Network routing (client to provider): 50-200ms depending on geographic distance to the provider cluster
  • STT model inference: 100-300ms for streaming-optimized architectures
  • Post-processing: 50-100ms for punctuation and normalization
  • Network return and client rendering: 50-100ms for final delivery to the display

Each stage adds independently. A provider with 150ms inference time sitting behind 200ms of network overhead and 100ms of client rendering produces 450ms of total delay, and that's before accounting for chunking. Understanding each component lets you diagnose where delay accumulates rather than treating the whole pipeline as a black box.

Table 1: Latency budget breakdown

Component Estimated delay (ms) Optimization strategy
Client audio capture and chunking 50-250 Use 100-200ms chunk sizes, minimize jitter buffer
Network routing (client to provider) 50-200 Select provider region closest to your user base
STT model inference 100-300 Use streaming-optimized models
Post-processing (punctuation, normalization) 50-100 Defer non-critical enrichment to the async pass
Network return and client rendering 50-100 Apply optimistic UI updates for interim tokens
Total (typical range) 300-950

The 500ms budget for perceived responsiveness

The ITU-T G.114 recommendation, which addresses one-way delay in voice telephony, specifies under 150ms for high-quality call experience and provides a reference point for real-time communication latency tolerance more broadly. For live note-taking, the threshold is more forgiving: Nielsen Norman Group research places the limit for flow-state UX at around 1 second, with 500ms as the point where users begin to notice delay. Beyond 500ms for partial transcripts, users stop reading the live text and wait for the meeting to end, which defeats the purpose of real-time transcription entirely.

How latency impacts live note-taking UX

The UX cost of latency is not linear. A 200ms delay is barely perceptible. A 700ms delay is clearly noticeable. A 1,200ms delay breaks the conversational loop, because the transcript of what was said five exchanges ago arrives as the current speaker starts a new thought.

Typing-to-display latency and meeting participation

The RAIL performance model defines user interactions as requiring feedback within 100ms to feel instantaneous. Live transcription operates in the same perceptual range: partial transcripts arriving under 300ms feel like the system is keeping pace with speech. When captions lag, participants stop reading mid-sentence and redirect attention to the speaker directly, missing the benefit of having a live note-taker. This effect compounds in multilingual meetings, where participants rely on live transcription as a comprehension aid. The multilingual meeting transcription guide covers how latency and non-English accuracy interact in production.

Why users disengage

Users disengage from a live note-taker for two reasons: output arrives too late to be useful, or it changes too frequently to be trusted. Both stem from the same architectural cause. Streaming systems that push interim results too aggressively produce frequent corrections that feel unstable. Systems that wait for high confidence produce output too slowly to be useful live. The practical answer is a hybrid: display interim results with visual differentiation and overwrite with final stable output once the model has sufficient context. The meeting bot speech recognition guide describes this as standard practice in production meeting assistants.

Evaluating live note-taking speed

Consistent measurement is the prerequisite for comparing providers or diagnosing your own pipeline. Without structure, two engineers benchmarking the same provider under different network conditions will report different numbers and draw wrong conclusions.

Measuring time to first transcript token

Time to first transcript token (TTFT) is the elapsed time from speech start to the first partial transcript token arriving at the client. To measure it accurately:

  1. Timestamp when audio capture starts (or the first VAD trigger).
  2. Timestamp when the first WebSocket message containing transcript text arrives at the client.
  3. Calculate the delta and repeat across multiple utterances of varying length.
  4. Report P50 (median), P95 (tail latency early warning), and P99 (worst-case) separately.

Conflating these into a single average hides tail behavior. A P50 of 280ms with a P99 of 1,400ms means the slowest 1% of utterances take nearly 5x longer than the median, a tail worth monitoring even when most sessions feel responsive.

Measuring interim, final, and network delays

Interim transcripts are unfinished hypotheses. Final transcripts are settled results after endpointing. Log both timestamps: when the first byte of interim text arrives, and when the final segment completes. These two numbers tell you whether a provider suits live conversations or belongs in offline workflows.

To isolate provider inference time from network latency, measure round-trip time on your WebSocket connection before sending any audio payload. Test against US-based and EU-based provider endpoints separately, because intercontinental routing adds significant overhead that inflates apparent provider latency. We provide EU-west and US-west clusters so you can select the region closest to your user base to minimize the network contribution to your total budget.

Transcription provider API latency

Once you've isolated network RTT, the remaining delay comes from provider-side processing: VAD, chunking, inference, and post-processing. Send a fixed-length, pre-recorded audio payload over WebSocket and timestamp from send to first response. Our WebSocket API uses a straightforward connection model for real-time streaming. Here's a minimal example from our official live transcription documentation:

const WebSocket = require("ws");

const ws = new WebSocket("wss://api.gladia.io/audio/text/audio-transcription");

ws.on("open", () => {
  ws.send(JSON.stringify({
    x_gladia_key: "YOUR_GLADIA_API_KEY",
    language_behaviour: "automatic multiple languages",
    sample_rate: 16000,
  }));
});

ws.on("message", (data) => {
  const response = JSON.parse(data);
  if (response.event === "transcript" && response.transcription) {
    const { transcript, confidence, type } = response.transcription;
    console.log(`[${type}] ${transcript} (confidence: ${confidence})`);
  }
});

Watch the real-time React app walkthrough or the no-code playground walkthrough to see latency behavior before writing integration code.

At Gladia, we target ~300ms final latency for real-time workflows with Solaria-1. Benchmark this against your specific audio conditions rather than treating it as a guarantee, because messy real-world audio increases processing time at every stage.

Audio conditions for robust real-time transcription

Provider benchmark numbers are typically collected on clean, single-speaker audio. Your users are on noisy Wi-Fi calls with laptop microphones, background audio, and overlapping colleagues.

Noise, speaker overlap, and diarization

Background noise forces streaming models to process longer audio windows before reaching sufficient confidence to emit partial tokens, which directly increases TTFT. A model that hits 280ms TTFT on a clean recording may exceed 500ms on a typical office call. Test providers on audio that matches your actual call conditions.

Our async benchmark evaluates Solaria-1 against 8 providers across 7 datasets and 74+ hours of audio, including conversational and noisy conditions. Solaria-1 achieves up to 29% lower WER than alternatives on conversational speech under those conditions.

Real-time speaker diarization is architecturally compromised. Streaming systems assign speaker labels as audio arrives, and those decisions are permanent. There's no going back to correct an early misattribution the way a batch system can after seeing the full recording.

The accurate solution is a hybrid: use real-time transcription for the live UX and run a full async diarization pass after the call ends. We provide speaker diarization powered by pyannoteAI's Precision-2, a dedicated diarization model trained on multi-speaker conversational audio, in async workflows only. This is intentional: full-recording context is what makes diarization reliable. The Gladia x pyannoteAI webinar walks through the architecture and accuracy trade-offs in detail.

Mixed-language meetings and chunk size

Language identification in real-time adds latency because the model needs sufficient audio context to detect which language is being spoken. When speakers switch languages mid-sentence, a system not built for code-switching either fails silently or produces garbled output. We handle mid-conversation code-switching across all 100+ supported languages with Solaria-1, including 42 not covered by any other API.

Chunk size controls the latency-accuracy trade-off directly. Streaming buffer sizes between 100ms and 250ms provide the right balance: low enough latency for responsive partial transcripts, large enough context for stable word boundaries. Start at 200ms and tune downward if your P95 TTFT stays inside your budget. The code-switching in ASR guide covers how language detection and chunk size interact at the model level.

Comparing real-time transcription API latency

Live vs. asynchronous transcription latency

The core trade-off between real-time and async determines which workflow fits each use case.

Table 2: Real-time vs. async trade-offs

Workflow Average latency WER impact Primary use case
Real-time streaming ~300ms final transcript Higher WER vs. async (partial audio context only) Live captions, real-time notes during call
Async (batch) ~60s per hour of audio Baseline WER, full context used Post-meeting summaries, CRM sync, coaching

Async transcription processes the full recording before producing output, giving the model both past and future context for every word. That context is what drives the WER advantage. The standard architecture for meeting assistants is real-time for the live UX and async for the final notes users rely on. We separate the streaming inference path from the full-context async path with Solaria-1 rather than applying the same model to both, as covered in the end-to-end voice agents webinar.

Steps for consistent latency testing

  1. Fix audio conditions: Use three test sets: clean single-speaker, noisy multi-speaker, and a bilingual call sample.
  2. Control network variables: Test from a server in the same region as your target users, not from your development machine.
  3. Measure P50 and P95 separately: A provider with excellent P50 but bad P95 produces frequent complaints from a minority of high-value sessions.
  4. Include all pipeline stages: Capture timestamps from audio start to client render, not just from send to first API response.

Regional latency and data residency

Network routing from a user in London to a US-only provider cluster adds significant round-trip delay before a single byte of inference happens. For EU users, that overhead can push a comfortable 350ms pipeline into a noticeable 500ms on geography alone. We operate in multi-region data residency configurable to your geographic footprint. with data residency controls to meet both compliance requirements and latency targets. On Growth and Enterprise plans, we never use audio data for model retraining and no opt-out is required. On the Starter plan, customer data is used for training by default.

Delivering responsive live note-taking

Managing audio input and optimizing perceived latency

Start with the source. Noisy, low-bitrate audio forces the model to work harder before emitting a confident token:

  • Use 16kHz mono PCM as your default format to minimize encoding overhead.
  • Apply client-side VAD to avoid sending silence, which wastes bandwidth and confuses endpointing.
  • Keep chunk size between 100ms and 200ms.

The UX can feel faster than the pipeline is. Three techniques production teams use consistently:

  1. Optimistic interim display: Show partial transcripts as they arrive, styled in lighter text to signal instability. Users read ahead even when words are marked provisional.
  2. Stable token anchoring: Once a word appears in two consecutive interim results, treat it as stable and apply full styling to eliminate flicker.
  3. Typing indicator fallback: If TTFT is elevated, show a typing animation rather than a blank screen. A typing indicator signals work in progress rather than failure.

Provider selection based on latency and cost profiles

For a product processing 10,000 hours of meeting audio per month, the cost model matters as much as the latency benchmark. Our per-hour pricing runs as low as $0.20/hr for async and $0.25/hr for real-time on Growth plans, with all audio intelligence features included in the base rate: diarization, NER, sentiment analysis, translation, and summarization. No add-on fees apply. Build your cost model at 10x and 100x current volume before committing. Providers that charge a base rate and then stack separate fees for each intelligence feature can surprise finance teams at scale.

"Gladia provides a highly accurate real-time speech-to-text solution for high volumes of support and service calls. Latency is low and accuracy high, even for numericals." - Verified user review of Gladia

Alerting on latency spikes

Production latency degrades as traffic patterns shift and infrastructure capacity fluctuates. Track metrics at multiple pipeline stages separately: TTFT, final transcript delay, and network RTT. Blending them into a single average makes it impossible to attribute a spike to network degradation versus provider inference slowdown. Set P95 alerts at thresholds calibrated to your user expectations rather than averages. Monitor the Gladia status page for incident history and uptime tracking. A well-documented incident history signals operational maturity, which is why our benchmark methodology is published openly and reproducibly.

Get started

Start with 10 free hours and have your WebSocket integration in production in less than a day. Test Solaria-1's ~300ms real-time latency alongside the async pipeline for post-meeting accuracy using the same API key. Our pricing page shows current per-hour rates for each plan.

FAQs

What is the acceptable latency threshold for live meeting transcription?

For partial (interim) transcripts to feel responsive, end-to-end latency should stay under 500ms, with under 300ms TTFT as the target for natural-feeling real-time display. Beyond 500ms, users typically stop reading live output and wait for the post-meeting summary.

How do you track P95 latency for a responsive transcription UX?

Log timestamps at audio capture start and at first WebSocket response received, then report P50, P95, and P99 across multiple utterances per test condition. Tracking percentiles separately exposes tail behavior that averages hide.

How do you achieve accurate speaker attribution in a live meeting?

Accurate speaker diarization requires full audio context, which means batch (async) processing after the call ends rather than real-time streaming. Use real-time transcription for the live UX during the meeting, then run an async diarization pass to produce accurate speaker labels for the final notes.

What accuracy trade-off does real-time processing carry compared to async?

Real-time STT runs on partial audio context, which produces higher WER than async transcription on the same recording because the model lacks future context when processing each token. For meeting summaries and CRM entries requiring high accuracy, async is the right workflow.

Key terms glossary

STT (speech-to-text): The process of converting spoken audio into written text using an acoustic and language model. Accuracy is typically measured as word error rate (WER) on a labeled test set.

WER (word error rate): The ratio of insertions, deletions, and substitutions in a transcript to the total number of reference words, expressed as a percentage. Lower is better.

DER (diarization error rate): The proportion of audio time incorrectly attributed to the wrong speaker, measured over the full recording. Used to evaluate speaker separation quality in async batch pipelines.

TTFT (time to first transcript token): The elapsed time from when a speech segment begins to when the first partial transcript token arrives at the client. The primary metric for perceived real-time responsiveness.

WebSocket: A full-duplex communication protocol over a single TCP connection that enables low-overhead, bidirectional streaming between a client and a server, used for real-time audio and transcript exchange in streaming STT pipelines.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more