News

Pipeline vs. Realtime: Architectural Differences and Latency Impact for Voice Bots

April 30, 2026

STT–LLM–TTS pipeline or realtime speech-to-speech? Architectural differences, latency impact, landscape of realtime and TTS models in 2026, and a guide to choosing the right approach.

Published on April 30, 2026 | AI Voicebots & Voice Agents

When building a voice bot, the first structural decision is often the least visible: which architecture to adopt? Two major families compete today — the STT–LLM–TTS pipeline and realtime speech-to-speech. This choice determines latency, voice quality, cost, debuggability, and regulatory compliance. There's no universal answer — but there is a rigorous way to find yours.

This article builds on our earlier analysis of voice bot latency and goes further into architectural choices and the landscape of available models in 2026.

The STT–LLM–TTS Pipeline: the reference architecture

How it works

The pipeline chains three specialized models in sequence:

1. STT (Speech-to-Text): transcribes the user's audio into text 2. LLM (Large Language Model): reasons over the text and generates a response 3. TTS (Text-to-Speech): synthesizes the response into audio

Upstream, a VAD (Voice Activity Detection) detects when the user is speaking; downstream, audio transport (WebRTC) delivers the sound to the listener.

Sequential vs. streaming: a 1.5-second difference

In its naive form, each stage waits for the previous one to complete — result: 1.5 to 2 seconds of minimum delay. Unusable in production.

In its streaming form, stages overlap: STT sends partial transcripts to the LLM while the user is still speaking; the LLM sends its first tokens to TTS before it has finished generating; TTS starts synthesizing from the first words it receives. With this overlap, latency drops to 400–800 ms — enough for natural conversation.

What gets lost in translation

The audio→text→audio conversion doesn't carry everything. Tone, hesitations, emotion, speech rhythm — everything that doesn't transcribe — disappears at the STT stage. The LLM sees only words. The TTS restores them with generated prosody, not perceived prosody.

Pipeline strengths

  • Modular: each component is independently swappable
  • Debuggable: text is visible at every stage, errors are traceable
  • Flexible: mature tool calling, customizable turn detection, free provider choice
  • Compliant: PII redaction, audit logging, geo-controlled hosting
  • Telephony-compatible: works with 8 kHz telephone network codecs

Realtime speech-to-speech: the unified model

How it works

A single multimodal model receives raw audio from the user and directly produces audio in return. No intermediate transcription, no conversion. The model hears, reasons, and speaks.

The structural latency advantage

Where the pipeline chains three model calls (plus serialization overhead), realtime makes just one. Target latency: 200–400 ms — structurally lower than even a well-optimized pipeline.

What it captures additionally

The model doesn't just hear words — it hears how they're said. Tone, hesitation, emotion, rhythm: all of this informs the response, including its prosody. The exchange sounds more natural, more human.

Realtime limitations

  • Black box: hard to know why the model responded a certain way
  • Limited turn detection: little control over end-of-turn parameters
  • Less mature tool calling: varies by model and provider
  • Variable cost: billed per second of audio, hard to optimize
  • Telephony challenges: models trained on high-quality audio (16–48 kHz), incompatible with telephone networks (8 kHz)

Current realtime model landscape

The speech-to-speech market is evolving rapidly. Here are the major players available in April 2026.

GPT-Realtime-1.5 — OpenAI (February 2026)

The highest-performing model for conversational dynamics with 95.7% on Full Duplex Bench and an average latency of ~320 ms end-to-end. Compared to the previous version: +10.23% alphanumeric transcription accuracy, +7% instruction following, +5% Big Bench Audio reasoning. Strength: mature tool calling and the best natural conversation score on the market.

Gemini 2.5 Flash Native Audio — Google

Production latency of ~400 ms, support for more than 70 languages with live voice translation. Score of 71.5% on ComplexFuncBench for multi-step function calls, 90% instruction adherence. Good balance of latency and capabilities.

Gemini 3.1 Flash Live — Google (March 2026)

Launched March 25, 2026, this native audio model is designed for real-time dialogues. It supports more than 90 languages, maintains conversational context twice as long as its predecessor, and better filters background noise (traffic, television). It significantly improves adherence to complex system instructions, even in unpredictable conversations. Accessible via the Live API for developers.

Qwen3-Omni — Alibaba Cloud

Two variants: the standard 30B model and Qwen3.5 Omni Flash Realtime reaching 0.79s to first audio. The Thinker-Talker MoE (Mixture of Experts) architecture achieves a theoretical latency of 234 ms in streaming. Supports 119 written languages, 19 for voice comprehension, 10 for generation. A strong alternative for non-English deployments.

Hume EVI 3 — Hume AI (May 2025)

A radically different approach: EVI 3 is a third-generation speech-language model capable of instantly generating any voice and personality via a prompt, without being limited to predefined speakers. It unifies transcription, comprehension, and voice generation in a single model, and expresses 30 distinct emotions and vocal styles (from "exhilarated" to "whispering"). Its TTS engine Octave interprets emotional cues in a script and delivers them as a human actor would. Particularly suited to high emotional-value conversational experiences.

Step-Audio R1.1

Leader in audio reasoning quality with 97.0% on Big Bench Audio. A model to watch for use cases requiring complex reasoning.

Grok Voice Agent — xAI

Competitive latency of 0.78s to first audio. Natural integration with the xAI ecosystem.

Mistral Voxtral Realtime (February 2026)

Specifically designed for real-time French transcription. A notable choice for French-priority deployments.

Inworld Realtime API

Unlike others, Inworld uses an optimized STT+LLM+TTS pipeline (500–800 ms) rather than true native speech-to-speech. In return, it offers the best voice quality on the market: the TTS 1.5 Max component leads the Artificial Analysis Speech Arena with an Elo score of 1,236 (March 2026), and allows routing through hundreds of different LLM models.

Dedicated TTS model landscape

For pipeline architectures, TTS model choice is decisive — both for perceived quality and for the latency of the first audio produced (TTFB).

ElevenLabs Turbo v2.5

The market reference for voice quality. TTFB around 75 ms, support for 32 languages, voice cloning, expressive voices. Ideal when audio quality is paramount.

Cartesia Sonic

One of the fastest on the market with a TTFB of ~50 ms. Good overall quality, well-suited for latency-demanding real-time deployments.

OpenAI TTS / gpt-4o-mini-tts

TTFB around 100 ms. The `gpt-4o-mini-tts` variant allows vocal style control via natural language instructions. Easy to integrate for teams already in the OpenAI ecosystem.

Google Neural2 / Chirp HD

Support for more than 40 languages, TTFB ~120 ms. Excellent multilingual coverage, particularly for non-European languages.

Azure Neural TTS

The widest catalog: more than 400 voices in 140 languages. Ideal for international deployments or sectors requiring Microsoft certifications (HIPAA, SOC 2, ISO 27001).

Deepgram Aura

Optimized for telephony with a TTFB of ~50 ms. Designed for calls, less suited to high-fidelity web use cases.

Play.ht

Instant voice cloning, regional accents, good multilingual coverage. TTFB around 150 ms.

Gemini 3.1 Flash TTS — Google (April 2026)

Announced April 14, 2026, this model brings granular control over vocal style via audio tags: mid-sentence adjustment of tone, tempo, and emotion via natural language. Support for more than 70 languages. All generated audio is watermarked with SynthID to prevent disinformation. High-potential model for expressive applications.

Hume Octave

Hume's TTS engine works like a speech-language model: it interprets narrative turns, emotional cues, and character traits in text, then delivers them realistically. More than a TTS, it's an AI "voice actor". Particularly suited to experiences requiring strong emotional charge.

In-depth comparison: Pipeline vs. Realtime

CriterionSTT–LLM–TTS PipelineRealtime S2S
Latency400–800 ms (streaming)200–400 ms
Voice qualityExcellent with best TTS modelsNatural, prosodically aware
Prosody / emotionLimited (audio→text conversion)Perceived and conveyed
Tool callingMature, reliable, structuredVaries by model
Turn detectionFull control, customizableModel's black box
DebuggabilityText visible at each stageAudio in / audio out, opaque
ModularityTotal: each component swappableLocked to the model
Telephony (8 kHz)Compatible (telephony-optimized STT)Difficult (trained on 16–48 kHz)
Compliance / GDPRGranular, PII redaction possibleCentralized, variable residency
MultilingualBest STT/TTS per languageDepends on the model

The half-cascade: the best of both worlds

You don't have to choose one or the other. Two hybrid configurations are worth attention:

Realtime + dedicated TTS (the "half-cascade"): the realtime model handles input — hears audio, picks up tone, reasons — but outputs text rather than audio. This text is sent to a dedicated TTS (ElevenLabs, Cartesia, Gemini 3.1 Flash TTS...). You keep emotional perception on input and voice control on output.

Realtime + parallel STT: an STT model runs alongside the realtime model to produce a faithful transcript. Useful in regulated sectors that require an auditable transcript.

How to choose?

  • Emotional naturalness priority (personal assistant, mental health, coaching) → Realtime or Hume EVI 3
  • Telephony, call center, IVR → Pipeline with telephony-optimized STT
  • Strict compliance (healthcare, finance, public sector) → Pipeline with localized hosting
  • Complex tool calling (calendar, CRM, appointment booking) → Pipeline
  • Critical multilingual → Pipeline with language-specific STT/TTS
  • Rapid prototyping → Realtime (fewer components to assemble)
  • Brand voice control → Pipeline or half-cascade with dedicated TTS

Conclusion

The choice between pipeline and realtime is not a purely technical choice — it's a product choice. It depends on your sector, your users, your compliance requirements, and how much weight you give to voice naturalness.

The model market is evolving at a remarkable pace: GPT-Realtime-1.5, Gemini 3.1 Flash Live, Hume EVI 3, Qwen3-Omni — every quarter brings new players and new benchmarks. Choosing today without anticipating market evolution risks locking yourself into an architecture that no longer delivers on its promises tomorrow.

Versatik guides you through this architectural choice: use case analysis, model recommendations, production deployment, and continuous optimization. Our teams work daily with the latest realtime models and STT–LLM–TTS pipelines, knowing their real-world advantages — beyond the benchmarks.

30 seconds to book 30 minutes

Unsure whether pipeline or realtime is right for your voice bot? Our team can help you make the right call in 30 minutes.

Book a meeting →

    Pipeline vs. Realtime: Voice Bot Architecture & Latency | Versatik