TL;DR: Most meeting note-takers don't need real-time transcription. Streaming ASR delivers audio to your pipeline as it's spoken, but batch (async) models access the full audio context, which means better accuracy, simpler infrastructure, and lower per-hour costs. Real-time is the right choice when your product requires live on-screen feedback during a call. For post-meeting summaries, action item extraction, and compliance recording, async processing is the stronger architectural pick.
Real-time transcription models carry an accuracy penalty relative to async in conditions where future phonetic context affects word resolution, and for a meeting note-taker that gap is the difference between a usable summary and a transcript full of garbled speaker turns.
Yet streaming is a common default for meeting tooling, and teams frequently spend weeks debugging WebSocket connection drops and partial transcript logic before concluding their use case required only an accurate summary delivered five minutes after the call.
Choosing between async and real-time transcription is an architectural decision that dictates your infrastructure complexity, unit economics at scale, and WER in production. This guide gives you the latency benchmarks, accuracy trade-offs, and cost models to make the right call for your AI note-taker. For meeting-note products, the real decision is not only which transcription mode to use, but which audio infrastructure layer will reliably capture, structure, and pass conversation data downstream.
The architectural divide: async vs. real-time transcription
The two approaches differ in one fundamental way: when the model sees the audio.
Asynchronous (batch) transcription means you upload the complete audio file to the API after the recording ends. The model processes the entire utterance with full forward and backward context available, then returns a final transcript via webhook or polling. No partial results, no connection state to manage.
Real-time (streaming) transcription means audio chunks (typically tens to hundreds of milliseconds) arrive at the ASR model as the speaker talks.You get two output types: partial results (low-latency but unstable, useful for on-screen live captions)and final results (stable transcriptions generated after the endpointing mechanism detects the end of an utterance). You run real-time pipelines over WebSocket connections, which require client-side logic to handle connection drops, reconnection backoff, and merge logic between partial and final transcripts.
The trade-off: accuracy vs. latency in production
Why real-time models carry an accuracy penalty in common meeting conditions
Streaming ASR encoders model output as a function of input up to time t, while batch ASR encoders model output as a function of the complete input up to time T. Sliding-window causal attention and right-chunk lookahead techniques allow limited future context in streaming systems, but under the conditions common to meeting audio - accented speech, code-switching, and overlapping speakers, streaming models tend to produce higher WER than batch processing on equivalent audio, because the encoder cannot access future context to resolve ambiguous boundaries.
This gap is most visible in three conditions common to meeting audio: accented speech (where more phonetic context resolves ambiguous sounds), code-switching (where retroactive language classification is not available in real-time, because the model cannot update prior output once an audio chunk has been emitted), and overlapping speakers (where diarization accuracy depends on full utterance context). A production benchmark from Claap's case study shows 1-3% WER achieved using Gladia's async pipeline, with one hour of video transcribed in under 60 seconds. In practice, transcript quality sets the ceiling for everything downstream, including summaries, action item extraction, CRM enrichment, and any other workflow built on top of the conversation data.
Latency benchmarks for natural conversation
Research from NN Group establishes three cognitive thresholds for system response time: 0.1 seconds feels instantaneous, 1.0 second keeps the user's flow uninterrupted, and 10 seconds is the boundary for maintaining attention. For conversational applications, a study on conversational turn-taking found an average 239ms gap between speakers in English. That makes sub-second latency a more relevant benchmark for real-time conversational UX than generic web response-time guidance alone.
| Use case |
Target latency |
Why |
| Natural conversation (voice agents) |
Under 300ms final transcript |
Matches human turn-taking baseline |
| Live captions, UI feedback |
Under 100ms partial transcript |
Perceived as instantaneous |
Use case mapping for meeting note-takers
Match your architecture decision to the conditions below. If your trigger is on the left, build async. If it's on the right, build real-time.
| Choose async when... |
Choose real-time when... |
| Users receive the transcript after the call ends and latency above 300ms is acceptable |
Users need transcription output visible during the call |
| You're generating post-meeting summaries, action item extraction, or diarized speaker attribution |
You're building live agent assist, interrupt-based voice agent pipelines, or LLM inference that depends on sub-300ms transcript delivery |
| You're processing compliance recordings or call archives, where full-context accuracy matters more than speed |
You're delivering live captions for accessibility or real-time UI feedback |
| You're running batch analytics on recorded calls and can tolerate processing time in exchange for higher transcription accuracy |
You need partial transcripts under 100ms for perceived-instantaneous UI responsiveness |
When to build an asynchronous pipeline
Build async when your user doesn't need to see transcription output until after the call ends. This covers the majority of meeting assistant use cases: post-meeting summaries, compliance recording, searchable archives, and batch analytics pipelines.
For recorded video transcription, word-level timestamp precision depends on accurate segment boundaries, an output that requires full-context processing because the model needs to resolve phoneme boundaries across the complete audio before committing to a final transcript. Segment-level accuracy of this kind is not reliably achievable with streaming, where the model commits to transcript output chunk-by-chunk before the full audio context is available.
The audio-to-LLM integration docs show how to pipe an async transcript directly into a summarization or action item extraction workflow in a single API call.
When to require real-time streaming
Real-time is the right choice when your product's core value depends on delivering transcription output while the call is in progress:
- Live agent assist: A contact center agent needs a suggested response before the caller finishes speaking. Sub-300ms final latency is the functional requirement.
- Voice agent pipelines: Applications built on LiveKit, Pipecat, or Vapi need streaming transcription as the first stage in the LLM inference pipeline.
- Live captions for accessibility: Delivering captions to meeting participants in real time requires streaming audio processing.
- Live collaborative features: Some meeting tools surface real-time sentiment or speaker identification to other participants as an in-call feature.
The end-to-end voice agent webinar walks through combining real-time STT with downstream TTS for sub-300ms round trips, and the Discord bot tutorial shows a practical WebSocket implementation for real-time voice applications. For most meeting assistant and post-call workflows, async transcription remains the primary and more reliable architecture, with real-time used only where immediate feedback is required.
Integration patterns: REST vs. WebSockets
The two architectures require fundamentally different client implementations.
Async (REST): A single POST request uploads the audio file or URL. The API queues and processes it, then returns results via webhook or polling. The client is stateless, which eliminates the failure surfaces specific to WebSocket lifecycle management, no connection drops to detect, no reconnection backoff to implement, no partial transcript state to reconcile. Async pipelines do have their own failure modes: webhook delivery can fail and polling requests can time out. Standard HTTP retry patterns handle both, which keeps the error-handling surface well-defined and bounded compared to a persistent connection. That also reduces the need to stitch together separate vendors for transcription, enrichment, and downstream workflow logic, which is where many meeting-note stacks become brittle.
```python
response = requests.post(
"https://api.gladia.io/v2/pre-recorded/",
headers={"x-gladia-key": "YOUR_API_KEY"},
json={
"audio_url": "https://your-storage.com/meeting-recording.mp3",
"diarization": True,
"summarization": True,
"sentiment_analysis": True
}
)
transcription_id = response.json()["id"]
# Poll or receive via webhook - all features included at base rate
Real-time (WebSocket): You open a persistent connection, stream audio chunks, and handle partial and final transcript messages on separate event types. Reconnection logic, connection lifecycle management, and merge logic between partial and final results all live in your client code.
import asyncio
import websockets
import json
async def stream_audio():
uri = "wss://api.gladia.io/audio/text/audio-transcription"
async with websockets.connect(uri) as ws:
await ws.send(json.dumps({
"x_gladia_key": "YOUR_API_KEY",
"encoding": "WAV/PCM",
"sample_rate": 16000,
"language_behaviour": "automatic single language"
}))
# Stream audio chunks and handle partial/final transcript events
Data privacy, DPAs, and model retraining
For products handling meeting audio, two questions to resolve with any STT vendor before the contract stage:
- Does the vendor retrain their models on customer audio by default?
- Where is audio stored during processing, and for how long?
Data usage policies vary by provider and plan, so it is important to confirm how training, retention, and opt-out mechanisms are handled before processing sensitive audio.
On paid plans, customer audio is not used for model training by default. Enterprise adds zero data retention as a standard term.
Processing runs in EU-west and US-west regions, and on-premises and air-gapped deployments are available for organizations with strict data residency requirements. Gladia is GDPR-compliant, SOC 2 Type 2 certified, HIPAA-eligible, and ISO 27001 certified, with a Data Processing Agreement available for review.
Cost modeling for transcription at scale
The total cost of transcription in production depends on which features you enable, not just the base transcription rate. Per-feature add-on pricing makes total cost harder to model at scale when multiple features are enabled.
Worked cost comparison at 1,000 and 10,000 hours per month with diarization enabled, based on Gladia's published pricing and competitor pricing at the time of writing:
| Volume |
Gladia (Starter async, bundled pricing) |
AssemblyAI (base + diarization add-on) |
Deepgram (base + speaker diarization) |
| 1,000 hours/month |
$610 |
$150 + $20 = $170 |
~$582 |
| 10,000 hours/month |
$6,100 |
$1,500 + $200 = $1,700 |
~$5,820 |
Growth async pricing starts as low as $0.20/hour for teams at higher volume.
AssemblyAI's published pricing starts at $0.15/hour base with speaker diarization at $0.02/hour extra. That means the cost model rises as additional features are enabled beyond transcription alone.
Gladia uses usage-based hourly pricing with all audio intelligence features included in the base rate.
The Starter async tier is priced at $0.61/hour. The Growth async tier starts as low as $0.20/hour. Both plans use bundled feature and language pricing; see the pricing page for plan-level details. That makes total cost easier to model than pricing structures where individual features are metered separately.
Solaria-1 is the better reference point here, because Gladia’s current platform is built around a managed speech layer rather than the older Whisper-based stack. The cost comparison is not only API cost versus API cost, but managed platform cost versus infrastructure and engineering time combined, especially once diarization, hallucination mitigation, scaling, and maintenance are included. For teams evaluating production performance, Gladia’s latest benchmark shows where the platform leads, including 3x better DER and 29% lower WER on conversational speech than alternatives. If you want a product overview of the current model layer, use Solaria-1 as the internal reference instead. The Starter plan includes 10 free monthly hours, which is enough to run an evaluation on your own audio samples before committing to a cost model.
How this maps to Gladia's implementation
Solaria-1 runs both async and real-time pipelines from the same model with the same language coverage: broad multilingual coverage across 100+ languages, including Tagalog, Bengali, Punjabi, Tamil, Urdu, and Persian. Solaria-1 detects mid-utterance language transitions across 100+ languages and tags each segment with its identified language, which matters for BPO operations, multilingual support teams, and global meeting assistants processing both recorded and live audio.
The Aircall integration reduced transcription time by 95% after moving from a self-hosted solution, freeing engineering capacity for product features rather than infrastructure maintenance.
If your pipeline processes recorded audio: meetings, support calls, uploaded media, async transcription is likely the more accurate and operationally straightforward path. When evaluating vendors, check whether diarization is powered by a named model (such as pyannoteAI Precision-2), confirm language coverage against your actual speaker population, and verify privacy defaults around audio retention and retraining. Testing on your own audio samples, particularly those with accented speech or mid-conversation language switches, will surface accuracy gaps that benchmark numbers alone may not capture.
Data privacy defaults by plan
On paid plans, audio submitted through the API is not used to train Gladia's models by default. No opt-out configuration is required, and no enterprise contract clause is needed to activate this protection. Enterprise plans add zero data retention as a standard term, and a Data Processing Agreement is available for review before contract signature.
Gladia is SOC 2 Type 2 certified, GDPR-compliant, HIPAA-eligible, and ISO 27001 certified.
Evaluating on your own audio
Benchmark numbers reflect controlled dataset conditions. The most reliable signal for your specific use case is running evaluation against your own audio, particularly samples that include accented speech, code-switching, or overlapping speakers. Gladia includes 10 free hours per month, which is sufficient to run a representative evaluation across multiple languages and speaker conditions before committing to a paid tier.
FAQs
Is real-time transcription less accurate than async?
Yes. Streaming ASR models process audio without access to future context, so the model cannot resolve ambiguous phoneme boundaries or correct speaker attribution retroactively. The accuracy gap widens on accented speech, code-switching, and overlapping speakers.
What is the typical latency for real-time transcription?
For natural conversation, final transcripts need to arrive under 300ms to match human turn-taking latency. Gladia Solaria-1 supports low-latency real-time transcription suitable for conversational applications. In practice, real-time systems should target sub-300ms responsiveness for usable conversational UX, with partial transcripts available for immediate rendering and final transcripts following shortly after.
Can I run diarization in real-time?
No. Gladia does not currently offer full diarization in real time. Its async diarization is powered by pyannoteAI’s Precision-2 model, which depends on full audio context and runs as part of the async transcription pipeline. For real-time workflows, speaker handling is more limited and should not be described as full diarization or full speaker identity assignment.
What are the main benefits of async transcription for meeting notes?
Async gives the model the full audio context, which produces higher accuracy transcripts, simpler client-side integration (stateless REST calls with no WebSocket lifecycle to manage), and additional audio intelligence features configurable in the same API workflow.
Key terminology
Word Error Rate (WER): The standard ASR accuracy metric, calculated as substitutions plus deletions plus insertions divided by total reference words. A WER of 6% means roughly 6 words per 100 are transcribed incorrectly. Lower is better.
Diarization: The process of segmenting an audio stream by speaker identity, producing labeled output ("Speaker 1," "Speaker 2"). Gladia’s diarization is powered by pyannoteAI's Precision-2 model and runs as part of the async transcription pipeline, requiring full audio context.
Code-switching: Mid-conversation language changes, where a speaker switches languages within or between utterances. Code-switching is a known challenge for ASR systems that rely on a fixed declared language, because mid-utterance language changes fall outside the model's expected phoneme distribution. Solaria-1 detects mid-utterance language transitions across 100+ languages and tags each segment with its identified language.
Endpointing: The mechanism a real-time ASR system uses to detect when a speaker finishes an utterance, triggering generation of a final stable transcript from accumulated partial results. Endpointing configuration directly affects perceived latency and the frequency of cut-off transcripts in voice agent pipelines.