Pipeline vs. Realtime: Voice Bot Architecture & Latency

STT–LLM–TTS pipeline or realtime speech-to-speech? Architectural differences, latency impact, landscape of realtime and TTS models in 2026, and a guide to choosing the right approach.

Published on April 30, 2026 | AI Voicebots & Voice Agents

When building a voice bot, the first structural decision is often the least visible: which architecture to adopt? Two major families compete today — the STT–LLM–TTS pipeline and realtime speech-to-speech. This choice determines latency, voice quality, cost, debuggability, and regulatory compliance. There's no universal answer — but there is a rigorous way to find yours.

This article builds on our earlier analysis of voice bot latency and goes further into architectural choices and the landscape of available models in 2026.

The STT–LLM–TTS Pipeline: the reference architecture

How it works

The pipeline chains three specialized models in sequence:

1. STT (Speech-to-Text): transcribes the user's audio into text 2. LLM (Large Language Model): reasons over the text and generates a response 3. TTS (Text-to-Speech): synthesizes the response into audio

Upstream, a VAD (Voice Activity Detection) detects when the user is speaking; downstream, audio transport (WebRTC) delivers the sound to the listener.

Sequential vs. streaming: a 1.5-second difference

In its naive form, each stage waits for the previous one to complete — result: 1.5 to 2 seconds of minimum delay. Unusable in production.

In its streaming form, stages overlap: STT sends partial transcripts to the LLM while the user is still speaking; the LLM sends its first tokens to TTS before it has finished generating; TTS starts synthesizing from the first words it receives. With this overlap, latency drops to 400–800 ms — enough for natural conversation.

What gets lost in translation

The audio→text→audio conversion doesn't carry everything. Tone, hesitations, emotion, speech rhythm — everything that doesn't transcribe — disappears at the STT stage. The LLM sees only words. The TTS restores them with generated prosody, not perceived prosody.

Pipeline strengths

Modular: each component is independently swappable
Debuggable: text is visible at every stage, errors are traceable
Flexible: mature tool calling, customizable turn detection, free provider choice
Compliant: PII redaction, audit logging, geo-controlled hosting
Telephony-compatible: works with 8 kHz telephone network codecs

Realtime speech-to-speech: the unified model

How it works

A single multimodal model receives raw audio from the user and directly produces audio in return. No intermediate transcription, no conversion. The model hears, reasons, and speaks.

The structural latency advantage

Where the pipeline chains three model calls (plus serialization overhead), realtime makes just one. Target latency: 200–400 ms — structurally lower than even a well-optimized pipeline.

What it captures additionally

The model doesn't just hear words — it hears how they're said. Tone, hesitation, emotion, rhythm: all of this informs the response, including its prosody. The exchange sounds more natural, more human.

Realtime limitations

Black box: hard to know why the model responded a certain way
Limited turn detection: little control over end-of-turn parameters
Less mature tool calling: varies by model and provider
Variable cost: billed per second of audio, hard to optimize
Telephony challenges: models trained on high-quality audio (16–48 kHz), incompatible with telephone networks (8 kHz)

Current realtime model landscape

The speech-to-speech market is evolving rapidly. Here are the major players available in April 2026.

GPT-Realtime-1.5 — OpenAI (February 2026)

The highest-performing model for conversational dynamics with 95.7% on Full Duplex Bench and an average latency of ~320 ms end-to-end. Compared to the previous version: +10.23% alphanumeric transcription accuracy, +7% instruction following, +5% Big Bench Audio reasoning. Strength: mature tool calling and the best natural conversation score on the market.

Gemini 2.5 Flash Native Audio — Google

Production latency of ~400 ms, support for more than 70 languages with live voice translation. Score of 71.5% on ComplexFuncBench for multi-step function calls, 90% instruction adherence. Good balance of latency and capabilities.

Gemini 3.1 Flash Live — Google (March 2026)

Launched March 25, 2026, this native audio model is designed for real-time dialogues. It supports more than 90 languages, maintains conversational context twice as long as its predecessor, and better filters background noise (traffic, television). It significantly improves adherence to complex system instructions, even in unpredictable conversations. Accessible via the Live API for developers.

Qwen3-Omni — Alibaba Cloud

Two variants: the standard 30B model and Qwen3.5 Omni Flash Realtime reaching 0.79s to first audio. The Thinker-Talker MoE (Mixture of Experts) architecture achieves a theoretical latency of 234 ms in streaming. Supports 119 written languages, 19 for voice comprehension, 10 for generation. A strong alternative for non-English deployments.

Hume EVI 3 — Hume AI (May 2025)

A radically different approach: EVI 3 is a third-generation speech-language model capable of instantly generating any voice and personality via a prompt, without being limited to predefined speakers. It unifies transcription, comprehension, and voice generation in a single model, and expresses 30 distinct emotions and vocal styles (from "exhilarated" to "whispering"). Its TTS engine Octave interprets emotional cues in a script and delivers them as a human actor would. Particularly suited to high emotional-value conversational experiences.

Step-Audio R1.1

Leader in audio reasoning quality with 97.0% on Big Bench Audio. A model to watch for use cases requiring complex reasoning.

Grok Voice Agent — xAI

Competitive latency of 0.78s to first audio. Natural integration with the xAI ecosystem.

Mistral Voxtral Realtime (February 2026)

Specifically designed for real-time French transcription. A notable choice for French-priority deployments.

Inworld Realtime API

Unlike others, Inworld uses an optimized STT+LLM+TTS pipeline (500–800 ms) rather than true native speech-to-speech. In return, it offers the best voice quality on the market: the TTS 1.5 Max component leads the Artificial Analysis Speech Arena with an Elo score of 1,236 (March 2026), and allows routing through hundreds of different LLM models.

Dedicated TTS model landscape

For pipeline architectures, TTS model choice is decisive — both for perceived quality and for the latency of the first audio produced (TTFB).

ElevenLabs Turbo v2.5

The market reference for voice quality. TTFB around 75 ms, support for 32 languages, voice cloning, expressive voices. Ideal when audio quality is paramount.

Cartesia Sonic

One of the fastest on the market with a TTFB of ~50 ms. Good overall quality, well-suited for latency-demanding real-time deployments.

OpenAI TTS / gpt-4o-mini-tts

TTFB around 100 ms. The `gpt-4o-mini-tts` variant allows vocal style control via natural language instructions. Easy to integrate for teams already in the OpenAI ecosystem.

Google Neural2 / Chirp HD

Support for more than 40 languages, TTFB ~120 ms. Excellent multilingual coverage, particularly for non-European languages.

Azure Neural TTS

The widest catalog: more than 400 voices in 140 languages. Ideal for international deployments or sectors requiring Microsoft certifications (HIPAA, SOC 2, ISO 27001).

Deepgram Aura

Optimized for telephony with a TTFB of ~50 ms. Designed for calls, less suited to high-fidelity web use cases.

Play.ht

Instant voice cloning, regional accents, good multilingual coverage. TTFB around 150 ms.

Gemini 3.1 Flash TTS — Google (April 2026)

Announced April 14, 2026, this model brings granular control over vocal style via audio tags: mid-sentence adjustment of tone, tempo, and emotion via natural language. Support for more than 70 languages. All generated audio is watermarked with SynthID to prevent disinformation. High-potential model for expressive applications.

Hume Octave

Hume's TTS engine works like a speech-language model: it interprets narrative turns, emotional cues, and character traits in text, then delivers them realistically. More than a TTS, it's an AI "voice actor". Particularly suited to experiences requiring strong emotional charge.

In-depth comparison: Pipeline vs. Realtime

Criterion	STT–LLM–TTS Pipeline	Realtime S2S
Latency	400–800 ms (streaming)	200–400 ms
Voice quality	Excellent with best TTS models	Natural, prosodically aware
Prosody / emotion	Limited (audio→text conversion)	Perceived and conveyed
Tool calling	Mature, reliable, structured	Varies by model
Turn detection	Full control, customizable	Model's black box
Debuggability	Text visible at each stage	Audio in / audio out, opaque
Modularity	Total: each component swappable	Locked to the model
Telephony (8 kHz)	Compatible (telephony-optimized STT)	Difficult (trained on 16–48 kHz)
Compliance / GDPR	Granular, PII redaction possible	Centralized, variable residency
Multilingual	Best STT/TTS per language	Depends on the model

The half-cascade: the best of both worlds

You don't have to choose one or the other. Two hybrid configurations are worth attention:

Realtime + dedicated TTS (the "half-cascade"): the realtime model handles input — hears audio, picks up tone, reasons — but outputs text rather than audio. This text is sent to a dedicated TTS (ElevenLabs, Cartesia, Gemini 3.1 Flash TTS...). You keep emotional perception on input and voice control on output.

Realtime + parallel STT: an STT model runs alongside the realtime model to produce a faithful transcript. Useful in regulated sectors that require an auditable transcript.

How to choose?

Emotional naturalness priority (personal assistant, mental health, coaching) → Realtime or Hume EVI 3
Telephony, call center, IVR → Pipeline with telephony-optimized STT
Strict compliance (healthcare, finance, public sector) → Pipeline with localized hosting
Complex tool calling (calendar, CRM, appointment booking) → Pipeline
Critical multilingual → Pipeline with language-specific STT/TTS
Rapid prototyping → Realtime (fewer components to assemble)
Brand voice control → Pipeline or half-cascade with dedicated TTS

Conclusion

The choice between pipeline and realtime is not a purely technical choice — it's a product choice. It depends on your sector, your users, your compliance requirements, and how much weight you give to voice naturalness.

The model market is evolving at a remarkable pace: GPT-Realtime-1.5, Gemini 3.1 Flash Live, Hume EVI 3, Qwen3-Omni — every quarter brings new players and new benchmarks. Choosing today without anticipating market evolution risks locking yourself into an architecture that no longer delivers on its promises tomorrow.

Versatik guides you through this architectural choice: use case analysis, model recommendations, production deployment, and continuous optimization. Our teams work daily with the latest realtime models and STT–LLM–TTS pipelines, knowing their real-world advantages — beyond the benchmarks.

30 seconds to book 30 minutes

Unsure whether pipeline or realtime is right for your voice bot? Our team can help you make the right call in 30 minutes.

Book a meeting →

Pipeline vs. Realtime: Architectural Differences and Latency Impact for Voice Bots

The STT–LLM–TTS Pipeline: the reference architecture

How it works

Sequential vs. streaming: a 1.5-second difference

What gets lost in translation

Pipeline strengths

Realtime speech-to-speech: the unified model

How it works

The structural latency advantage

What it captures additionally

Realtime limitations

Current realtime model landscape

GPT-Realtime-1.5 — OpenAI (February 2026)

Gemini 2.5 Flash Native Audio — Google

Gemini 3.1 Flash Live — Google (March 2026)

Qwen3-Omni — Alibaba Cloud

Hume EVI 3 — Hume AI (May 2025)

Step-Audio R1.1

Grok Voice Agent — xAI

Mistral Voxtral Realtime (February 2026)

Inworld Realtime API

Dedicated TTS model landscape

ElevenLabs Turbo v2.5

Cartesia Sonic

OpenAI TTS / gpt-4o-mini-tts

Google Neural2 / Chirp HD

Azure Neural TTS

Deepgram Aura

Play.ht

Gemini 3.1 Flash TTS — Google (April 2026)

Hume Octave

In-depth comparison: Pipeline vs. Realtime

The half-cascade: the best of both worlds

How to choose?

Conclusion