Latency is the number-one enemy of voice bot experience. A breakdown of delay sources in an STT–LLM–TTS pipeline, and concrete levers to get under one second.
Published on April 28, 2026 | AI Voicebots & Voice Agents
Latency is the number-one enemy of voice bot experience. A response delay of more than one second is enough to break the illusion of natural conversation — the person on the other end hesitates, loses confidence, or simply hangs up. Yet latency is widely misunderstood: it has multiple faces, multiple sources, and multiple levers. Here's how to break it down and attack it methodically.
The golden rule: 1 second
For a voice conversation to feel natural, the voice bot must start responding within one second of the user finishing their sentence. Beyond that, the exchange feels artificial. Below 500 ms, the agent feels sharp and reactive.
But don't confuse measured latency with perceived latency. An agent that responds in 900 ms but lets the user interrupt naturally will often feel more fluid than an agent that responds in 600 ms but cuts off the user or doesn't let them finish their sentences.
Anatomy of the chain: where time goes
Every pipeline voice bot (STT → LLM → TTS) accumulates latency at each stage. Here are the target figures for a well-optimized streaming pipeline:
| Stage | Target latency | What determines it |
|---|---|---|
| Audio transport (WebRTC) | < 50 ms | Network topology, server proximity |
| STT — first partial token | 100–200 ms | Streaming vs. batch, provider choice |
| LLM — time-to-first-token | 200–400 ms | Model size, infrastructure |
| TTS — first audio | 100–300 ms | Streaming synthesis |
| Total target | < 1 s | End-to-end streaming pipeline |
A sequential pipeline — where each stage waits for the previous one to finish — typically produces 2 to 4 seconds of delay. Unusable in production. The streaming pipeline, which overlaps stages (STT sends partial transcripts to the LLM while the user is still speaking, the LLM sends first tokens to TTS as soon as they're available), is what makes natural conversation possible.
Architecture is the first lever
STT–LLM–TTS Pipeline
This is the current industry standard. Three specialized models in sequence: 1. STT (Speech-to-Text): transcribes audio to text 2. LLM: reasons and generates a text response 3. TTS (Text-to-Speech): synthesizes the response into audio
Advantages: modular, debuggable, controllable at each step. Latency can be compressed with good streaming. Downside: the audio→text→audio conversion loses prosodic information (tone, hesitation, emotion).
Realtime speech-to-speech
Some multimodal models (like Gemini Native Audio or GPT-4o Realtime) receive raw audio and return raw audio, without going through intermediate text. Structural latency advantage: fewer hops, fewer conversions. They also pick up the user's tone and emotion.
Downside: less control over each component, harder to optimize cost, and debugging is less transparent.
Half-cascade
An interesting compromise: use a realtime model for input (understanding audio with its prosodic nuances) but route output through a dedicated TTS to maintain voice control. Useful when brand voice identity matters.
Turn detection: the overlooked detonator
Turn detection — the decision about when the user has finished speaking — is the most underestimated link in the latency chain. It's what triggers the entire response pipeline.
A silence timeout set to 800 ms adds 800 ms to every response, before the STT, LLM, or TTS have even started. Over a 10-minute conversation, that's tens of seconds of cumulative delay.
Three main approaches:
VAD-only (Voice Activity Detection): detects audio silence. Simple, but adds latency proportional to the configured threshold. Sensitive to background noise.
STT endpointing: the STT provider emits an end-of-utterance signal based on its own model. Faster than waiting for silence, more informed than VAD alone.
Model-based semantic detection: a classification model reads the partial transcript in real time and predicts whether the user has finished their thought — regardless of silence. Can trigger the response before the silence even ends. This is the most accurate approach for complex cases ("I'd like to make an appointment for... uh... next Thursday").
Interruptions and perceived latency
Interruption handling is the most impactful lever on perceived latency — without touching measured latency.
If a user can naturally interrupt the agent (barge-in), they immediately regain control of the conversation, making the exchange feel more fluid even if the absolute latency stays the same.
The problem: naive barge-in detection relies on VAD, which also triggers on backchannels ("mm-hmm", "yeah"), sighs, background noise, and keyboard sounds. If every noise interrupts the agent, the conversation becomes choppy.
Modern approaches use a dedicated audio model that analyzes acoustic characteristics (waveform shape, onset, prosody) to distinguish a real interruption from an incidental sound — with inference in under 30 ms.
Practical levers: where to start
1. Monitor before optimizing
Without measurement, any optimization is blind. Key metrics to track:
- `e2e_latency`: total end-to-end latency per conversation turn
- LLM time-to-first-token (TTFT): time before the first response token
- TTS time-to-first-byte (TTFB): time before the first synthesized audio
Identify the dominant bottleneck before trying to optimize.
2. Co-locate agent and models
This is the highest-impact lever. An agent hosted in Europe calling models in the US can easily add 200 to 400 ms of unnecessary network latency at every stage.
3. Choose faster models for the dominant stage
Testing a lighter or more recent model on the slow link (often the LLM) usually offers the best effort-to-impact ratio.
4. Clean up tool calling
If the agent calls external tools (business APIs, database queries), these calls block response generation and can add significant, variable latency. Best practices:
- Limit the number of tool executions per turn
- Play a "thinking" sound during tool execution
- Verbally acknowledge the user ("Let me check that for you...")
5. Control context size
A very long system prompt or unpruned conversation history increases input tokens the LLM must process. The impact is generally small, but it's a free lever: keep the prompt concise and summarize long history.
6. Prewarm the VAD
In production, pre-loading the VAD model before sessions arrive avoids a startup delay of several seconds. Non-negotiable for any serious deployment.
In summary
Voice bot latency isn't a single parameter — it's the sum of several delays accumulated across a complex chain. The right approach:
1. Measure: identify the dominant bottleneck with per-stage metrics 2. Co-locate: put agent and models in the same region 3. Stream: ensure every stage (STT, LLM, TTS) streams its partial outputs 4. Tune turn detection: don't let an arbitrary silence timeout penalize every response 5. Handle interruptions properly: perceived latency matters as much as measured latency
There's no silver bullet. But by methodically attacking the dominant bottleneck, it's entirely possible to achieve end-to-end latency under 1 second — and build a voice bot that sounds truly natural.