How to measure latency in speech-to-text (TTFB, Partials, Finals, RTF): A deep dive

Published on Sep 30, 2025
How to measure latency in speech-to-text (TTFB, Partials, Finals, RTF): A deep dive

Latency can make or break a voice experience. Whether you’re building an agent that must stop speaking the moment a customer interrupts, or you’re captioning live content, you need a clear, reproducible way to measure how fast your STT really is, from first partial word to final transcript. 

At Gladia, we benchmark our models with rigorous, language-aware procedures that separate “time to first partial token” from “time to final result.” That discipline is how we consistently deliver sub-300 ms partial and ~700 ms final latency on 3-second utterances without sacrificing accuracy. 

Below is a practical guide you can adapt for your own stack. 

TL;DR 

Measure at multiple milestones: audio capture start → first partial hypothesis (TTFB/TTFT) → last partial update → final transcript. Use P50/P95/P99 per metric.

  • Benchmark both streaming and file modes; add Real-Time Factor (RTF) to quantify throughput vs. duration. 
  • Control the usual suspects: frame/chunk size, endpointing (VAD) thresholds, network jitter, and stabilization settings: each trades latency vs. stability/accuracy.
  • Report apples-to-apples: same utterance lengths, codecs, sampling rates, and silence budgets. 
  • Ship SLOs, not anecdotes: e.g., P95 TTFB ≤ 300 ms and P95 Final ≤ 800 ms for 3-s utterances, plus RTF < 1.0 in production. 

Why latency is tricky (and how to define it precisely) 

Latency is a sequence of milestones, not a single scalar. 

Audio capture starts → first partial token (TTFB) → partial update cadence → endpointing decision → final transcript 

Different knobs affect different milestones (e.g., frame size shifts TTFB; endpointing shifts final latency), which is why you should always report P50/P95/P99* per milestone instead of one blended number. Streaming STT APIs are designed around this timeline: they accept audio in frames and emit partial vs. final results by design. 

  • P50 (median): What a “typical” user sees. Half of requests are faster, half slower.
  • P95: Tail latency early warning. 5% of requests are worse than this.
  • P99: Critical tail. The worst 1% of requests—often where high-value traffic lives

When measuring latency for your real-time voice application, use these cut-and-dry, vendor-agnostic definitions:

Time To First Byte (TTFB)

Time To First Byte (TTFB) corresponds to the time from speech start to the first partial transcript arriving. It directly affects conversation AI agents’ perceived snappiness. The term TTFB is borrowed from web performance: client request → server response's first byte. 

At Gladia, for a 3-second English utterance on a stable network, we publish TTFB and Final latencies separately (median and P95), rather than one blended number. This is because customers feel “instant feedback” differently from “utterance completed.” 

Partial Update Cadence 

Partial Update Cadence refers to how frequently partials arrive (e.g., tokens/sec). Higher cadence improves interactivity (e.g. for barge-in, live captions). 

Final Transcript Latency 

Final Transcript Latency measures the time from end-of-speech to final, stable transcript for the utterance. This is what turns into ground truth for downstream tasks such as RAG, analytics NLU, etc. 

Endpointing Latency 

Endpointing Latency refers to the time between the actual end-of-speech and when the system decides the user has stopped talking (VAD/endpoint detection). Tunable silence thresholds heavily influence this. 

Real-Time Factor (RTF) 

RTF = processing time ÷ audio duration 

Use it to compare throughput and capacity planning; RTF < 1 means you can keep up with live audio. Track mean and tail (P95/P99). 

The critical knobs that move latency (and how to measure their impact) 

Frame / chunk size 

You stream audio in frames. Larger frames are efficient but add delay. Google recommends ~100 ms as a practical trade-off. For best results, measure with 20, 50, 80, 100 ms to find your knee. 

Endpointing (VAD) thresholds 

Silence duration and thresholds decide when an utterance “ends.” Defaults vary widely across ecosystems: 10 ms (configurable) according to some of our competitors’ docs vs. ~500 ms default in popular real-time agent stacks. Your choice affects final latency and turn-taking quality. 

Network transport & jitter 

WebRTC / GPRC / WebSocket pipelines have jitter buffers that smooth variable packet arrival—great for robustness, but buffers add delay. Test on “good” vs. “lossy” network profiles to quantify impact. 

Modern algorithms explicitly optimize a cost trade-off between extra latency and packet loss. When benchmarking, profile with a “good” target (low RTT/jitter) and a “stress” target (e.g., 80–150 ms RTT with jitter) so you can quantify the latency added by the buffer’s target delay.

Interpreting results (with realistic targets) 

Snappiness (TTFB) 

Aim for ≤ 100 ms P95 in interactive agents. This goal aligns with widely used UX heuristics (RAIL/MDN): interactions acknowledged within ~100 ms feel instantaneous. Beyond that, users would benefit from visible feedback.

Completion (Final) 

Final latency ≤ 700–800 ms P95 after end-of-speech is a good bar for real-time dialogs; tune silence budgets to avoid cutting users off.

Conversation health 

For telephony-style voice experiences, budget end-to-end one-way conversational delay using ITU-T guidance: keep it < 150 ms for “best quality,” and avoid exceeding 400 ms for general network planning. These bounds are helpful when you apportion delay across STT + TTS + network so that conversational turn-taking remains natural. 

Common pitfalls (and how to avoid them) 

Conflating TTFB with Final 

Users perceive these differently—agents feel responsive on fast TTFB even if final arrives later.

To account for this, publish both partial (P50/P95/P99) and final results. Keep your frame size, endpointing threshold, and network profile constant across runs so these numbers are apples-to-apples.

Hidden buffering 

Frame sizes, client audio encoders, and jitter buffers quietly add delay. Document your exact frame/chunk size (e.g., 100 ms guidance from Google) and transport. 

Endpointing defaults that don’t match your UX 

Agent stacks often default to ~500 ms silence; speech captioning may want shorter; dictation may want longer. Tune per use case and publish your thresholds. 

Network variability not modeled 

Always run tests under shaped conditions (RTT, jitter, loss) so bench numbers survive the real world. WebRTC jitter research shows buffers trade dropouts for delay. 

What “good” looks like (example SLOs you can adapt) 

  • Interactive agents (barge-in): =P95 Final ≤ 800 ms on 3-s utterances; endpointing silence ~300–600 ms depending on domain. 
  • Live captions: prioritize cadence and low TTFB; allow longer endpointing to avoid premature finals.
  • Dictation: tolerate longer endpointing for fewer fragment finals; optimize accuracy over rush. Adjust silence duration accordingly.

Final thoughts

When we built Solaria, we weren’t optimizing for leaderboard scores. We were optimizing for reality. We followed the exact best practices as outlined in this article when benchmarking. The result: Solaria consistently outperforms the competition on both TTFB (~270ms) and latency to final (~698ms).

When it comes to STT, you need to benchmark early, extensively, and often. Most of our customers benchmark on a quarterly basis. In addition to latency, they benchmark accuracy (WER, WAR, entity accuracy, and more) to help them catch regressions early and ensure consistent performance as their audio environments evolve. To go further, read our comprehensive guide on benchmarking STTs for real-world performance.

Latency FAQ 

What’s the difference between TTFB and Final? 

TTFB is your first partial—perceived responsiveness. Final is the stable transcript on utterance completion—functional correctness. Publish both. 

How do partial-result settings change latency? 

Stabilization reduces flicker and can reduce perceived latency but may reduce interim accuracy. Test both modes and report. 

What silence threshold should I use? 

There’s no universal value. Some of our competitors’ docs show very short defaults that are configurable; Realtime agent stacks often default to ~500 ms. Choose per UX (captio ns vs. agent) and publish it. 

Should I care about RTF if I already publish TTFB/Final? 

Yes. RTF predicts whether your system will keep up under concurrency and informs GPU/CPU capacity planning. Track mean and tail. 

How do network conditions factor in?

Jitter buffers smooth the ride but add delay. Always benchmark under controlled RTT/jitter/loss profiles so results generalize.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more

Speech-To-Text

How to measure latency in speech-to-text (TTFB, Partials, Finals, RTF): A deep dive

Latency can make or break a voice experience. Whether you’re building an agent that must stop speaking the moment a customer interrupts, or you’re captioning live content, you need a clear, reproducible way to measure how fast your STT really is, from first partial word to final transcript. 

Speech-To-Text

How to build multilingual AI voice agents for the global customer experience

Great customer support experiences rely on clear communication and deep understanding. Until recently, meeting that expectation at scale was nearly impossible—human agents can only handle so many languages, and even fewer can switch between them fluently.

Case Studies

How Attention closes more deals and powers smarter AI sales workflows with Gladia

The revenue tech stack is evolving fast. What used to be manual note-taking and inconsistent CRM updates is giving way to AI-powered workflows that turn every conversation into structured, actionable data. At the core of that shift is transcription: if the words aren’t captured quickly and accurately, everything downstream falls apart.

Read more