Speech-to-speech (STS) technology: direct voice signal processing, emotional nuance preservation, reduced latency. OpenAI Realtime API, Deepgram, Kyutai. Advantages vs traditional ASR+TTS architecture.

Advantages of Speech-to-Speech Technologies for Voicebots and AI Voice Agents

Updated March 11, 2026 | Voicebots

Speech-to-speech (STS) technology represents a significant advancement in the development of voicebots and AI voice agents. Unlike traditional architectures that chain together speech recognition (ASR), language processing (LLM), and speech synthesis (TTS), native STS models directly process voice signals within a single model, preserving emotional and conversational nuances while reducing latency. OpenAI, Google, xAI, Hume AI, and Kyutai have launched genuine end-to-end STS solutions, while others like Deepgram offer optimized voice-to-voice pipelines that remain performant but are architecturally distinct.

Traditional Voicebot Architecture and Its Limitations

Conventional voicebots operate through a sequential multi-step chain:

Automatic Speech Recognition (ASR/STT): converting user speech to text (100–500 ms)
Language model processing (LLM): understanding intent and generating a text response (350 ms–1 s+)
Speech synthesis (TTS): converting the text response to audio (75–200 ms)

This pipeline accumulates a typical total latency of 800 ms to 2 seconds, to which network and processing times must be added. Yet natural human conversation requires responses within a 300 to 500 ms window to feel fluid. This structural constraint creates perceptible pauses and a mechanical feel.

Beyond latency, each intermediate conversion causes information loss: emotional tone, rhythm, hesitations, emphasis, and individual vocal characteristics are lost as soon as speech is transcribed to text. The LLM reasons on flat text, stripped of all paralinguistic context.

Native Speech-to-Speech Architecture: A Paradigm Shift

Native STS technology eliminates these intermediate steps by using a single model that receives audio directly as input and generates audio directly as output. No intermediate text conversion is required for the main conversational flow.

At the core of this architecture, multimodal neural networks simultaneously analyze:

Acoustic patterns: recognition of sounds, words, and linguistic structures
Intonation and prosody: pitch, rhythm, rate, stress
Emotional markers: frustration, joy, hesitation, urgency
Semantic content: meaning and intent of the message

This direct audio → audio transformation preserves aspects of communication typically lost during text conversion: emotional tone, each speaker's unique characteristics, natural speech rhythm, and subtle conversational nuances.

Latency Reduction and Improved Conversational Flow

The most immediately measurable advantage of native STS is latency reduction. By eliminating multiple conversion steps, a single STS model can respond considerably faster than a traditional pipeline.

Current benchmarks show significant results:

xAI Grok Voice Agent: under 700 ms reliable latency, 780 ms average time to first audio token on the Big Bench Audio benchmark
OpenAI Realtime API: time to first voice (TTFV) of 450–900 ms after end of user speech, first text token in 180–300 ms
Hume AI EVI 3: inference latency as low as 300 ms
Kyutai Moshi: theoretical latency of 160 ms thanks to its full-duplex architecture

For comparison, an optimized STT+LLM+TTS pipeline typically achieves 500–1,260 ms under the best conditions, and often 800 ms–2 s in real-world conditions. The most advanced architectures (PolarGrid) have recently demonstrated a 364 ms optimized pipeline, but this requires extreme optimization of each component.

These improvements translate directly into more fluid conversations. Responses arrive at the naturally expected moment, creating an exchange dynamic close to human conversation.

Preservation of Emotional Nuances and Natural Expression

The most distinctive advantage of STS technology is its ability to preserve and reproduce emotional nuances. Traditional systems that convert speech to text inevitably lose paralinguistic characteristics — tone, pitch, rhythm, and emphasis — which often convey as much meaning as the words themselves.

STS technology maintains the acoustic signal throughout the processing chain. The xAI Grok Voice Agent illustrates this capability: it "understands the expressive range of human speech and can generate correspondingly expressive responses; it can laugh, whisper, and sigh". Similarly, Hume AI EVI 3 analyzes the tone, rhythm, and timbre of the user's voice to detect emotional cues and respond with appropriate emotional expressions.

In a blind test involving 1,720 participants, EVI 3 outperformed OpenAI's GPT-4o on seven dimensions: emotional expression, naturalness, voice quality, response speed, and interruption handling.

This emotional intelligence allows STS voicebots to recognize a frustrated customer and automatically adapt their tone to be more soothing, or detect urgency in a request and speed up processing — behaviors impossible with an intermediate text pipeline.

Enhanced Management of Conversational Dynamics

Human conversations are characterized by complex interaction patterns: interruptions, overlapping speech, hesitations, mid-sentence corrections, and meaningful silences. Traditional voicebots struggle to handle these dynamics.

Interruptions and Barge-in

Native STS systems can detect that a user has resumed speaking and immediately interrupt their response to listen. OpenAI's Realtime API and xAI's Grok Voice Agent natively support this "barge-in" behavior, establishing natural turn-taking. Kyutai Moshi goes even further with its full-duplex architecture: it can listen and speak simultaneously on two parallel audio streams, modeling natural speech overlap.

Conversational Context

Native STS models maintain richer conversational context because they have never reduced information to flat text. Variations in tone, energy, and rhythm across the entire conversation inform each response. Google Gemini Live API supports "proactive listening" — the model knows when to speak and when to stay silent. Hume EVI 3 integrates real-time data (web search, tools) into the conversation without interrupting the natural flow of dialogue.

Native Speech-to-Speech Players in 2026

OpenAI — Realtime API

OpenAI was one of the first to democratize native STS with its Realtime API, launched in preview in October 2024 then GA in August 2025. The GPT-4o model natively processes audio input and output via persistent WebSocket connections, enabling continuous streaming without intermediate text conversion.

Architecture: single multimodal model (GPT-4o), WebSocket/WebRTC
Latency: TTFV 450–900 ms, median ~1,355 ms in real-world conditions
Capabilities: function calling, interruption handling (VAD), preset voices
Availability: OpenAI directly + Azure AI Foundry
Notable limitation: latency increases on long sessions (60 turns: median 3.4 s)

Google — Gemini Live API (Native Audio)

Google launched the Gemini Live API with the Gemini 2.5 Flash Native Audio model (preview May 2025, updates in September and December 2025). A single unified model directly processes audio input and generates audio output, eliminating separate STT/TTS conversions.

Architecture: unified native audio model, WebSocket
Distinctive capabilities: Affective dialog (emotional adaptation), Proactive audio (intelligent listening), function calling with real-time Google Search
Languages: 24+ languages, 30+ distinct voices
Availability: Google AI Studio, Vertex AI
Notable limitation: latency that degrades on long audio sessions, reported by several developers

xAI — Grok Voice Agent API

Launched in December 2025, xAI's Grok Voice Agent API quickly became the benchmark leader in speech-to-speech. The same model powering Grok Voice Mode and Tesla cars is now accessible to developers via a WebSocket API.

Architecture: integrated speech-to-speech model, full-duplex WebSocket
Performance: 92.3% on Big Bench Audio (top score), average TTFT of 780 ms, reliable latency < 700 ms
Price: $0.05/minute ($3/hour), symmetric input/output pricing
Capabilities: function calling (web search, RAG, custom JSON tools), SIP telephony support (Twilio, Vonage), 100+ languages, 5 voices
Key advantage: very competitive performance/price ratio with integrated telephony support

Hume AI — EVI 3 (Empathic Voice Interface)

Hume AI launched EVI 3 in May 2025, a speech-language model specialized in emotional intelligence. EVI 3 integrates transcription, language understanding, and speech synthesis in a unified system.

Architecture: native speech-language model, interoperable with external LLMs (Claude, Gemini, DeepSeek, Llama)
Latency: ~300 ms inference
Emotional intelligence: analyzes tone, rhythm, and timbre to adapt responses emotionally
Customization: 100,000+ custom voices, generate a new voice in < 1 second from a text prompt, 30+ emotional styles
Voice cloning: from 30 seconds of audio
Price: from $0.02/min at scale

Kyutai Labs — Moshi

Kyutai Labs developed Moshi, presented as the "first real-time full-duplex spoken language model". Unlike other solutions, Moshi natively handles simultaneous speech: it can listen and speak at the same time on two parallel audio streams.

Architecture: end-to-end speech-to-speech with "Inner Monologue" (predicting text tokens aligned before audio tokens)
Latency: 160 ms theoretical
Open source: available on GitHub under open-source license
Languages: bilingual French/English
Availability: self-hosted, Scaleway, third-party APIs
Limitation: research/demonstration project, no turnkey enterprise offering

Sesame AI — CSM (Conversational Speech Model)

Sesame AI caused a sensation in early 2025 with its CSM model, open-source under the Apache 2.0 license. CSM does not position itself as a complete conversational STS model but as an ultra-realistic contextual speech generation model — listeners struggle to distinguish the generated voice from a human voice.

Architecture: multimodal model based on Llama 3.2, operates on RVQ audio tokens
Specificity: contextual awareness — adapts tone, rhythm, and expressiveness based on conversational history
Open source: 1B parameters, available on Hugging Face
Positioning: advanced contextual TTS component, not a complete conversational agent

Optimized Pipeline Solutions (non-native STS)

It is essential to distinguish true native STS models from optimized voice-to-voice pipelines that unify STT + LLM + TTS in a single API without using a single model.

Deepgram — Voice Agent API

Deepgram announced in February 2025 a "key milestone" in developing an STS architecture, but specified "when fully operationalized, this architecture will be delivered to customers" — it is not yet an available product.

The product actually commercialized is the Voice Agent API (GA June 2025), explicitly described as "combining speech-to-text, text-to-speech, and large language model (LLM) orchestration". It is an optimized pipeline using Nova-3 (STT) and Aura-2 (TTS) with an LLM of your choice (BYOM — Bring Your Own Model).

Strengths: enterprise-ready, HIPAA-compliant, BYOM, $4.50/hour all-inclusive
Flux model (Oct. 2025): conversational speech recognition model (CSR), not STS
Real positioning: best unified pipeline on the market for enterprise, but architecturally distinct from native STS

ElevenLabs — Speech-to-Speech (voice conversion)

ElevenLabs' "speech-to-speech" feature is voice conversion: it transforms a source voice into a target voice while preserving the content. This is not conversational STS for voicebots — it is a voice changer, useful for dubbing, content creation, and voice anonymization.

Fixie AI — Ultravox

Ultravox is an open-source multimodal LLM that directly understands speech without a separate ASR step, built on Llama 3.3 70B. Currently in audio-in, text-out mode — audio token output generation is in development. A promising building block for the future, but not yet a complete STS solution.

Comparative Table of Solutions (March 2026)

Solution	Type	Native STS	Latency	Price	Languages	Strengths	Limitations
xAI Grok Voice	Native STS	Yes	<700 ms	$0.05/min	100+	#1 benchmark, SIP telephony, function calling	xAI ecosystem, 5 voices
OpenAI Realtime	Native STS	Yes	TTFV 450–900 ms	~$0.06/min	Multi	Market pioneer, large ecosystem, Azure	Increasing latency on long sessions
Google Gemini Live	Native STS (Native Audio)	Yes	Variable	$3/$12 per M tokens	24+	Affective dialog, proactive audio, integrated Google Search	Reported unstable latency
Hume AI EVI 3	Speech-language model	Yes	~300 ms	$0.02/min	Multi	Emotional intelligence, 100K+ custom voices, voice cloning	Young platform
Kyutai Moshi	Native full-duplex STS	Yes	160 ms theoretical	Free (open-source)	FR/EN	Full-duplex, open-source, ultra-low latency	Research, not enterprise-ready
Sesame CSM	Contextual TTS	Partial	N/A	Free (open-source)	EN	Unmatched vocal realism, contextual	Advanced TTS, not a complete STS agent
Deepgram Voice Agent	STT+LLM+TTS Pipeline	No	~500–1,000 ms	$4.50/h	Multi	Enterprise/HIPAA, BYOM, controlled cost	No true STS, pipeline
ElevenLabs STS	Voice conversion	No	N/A	Variable	70+	Exceptional voice quality	Voice conversion, not conversational
Fixie Ultravox	Audio-in → Text-out	Partial	N/A	Free (open-source)	Multi	Native audio understanding, open-source	No audio output yet

Native STS vs. Optimized Pipeline: Strategic Analysis

Advantages of Native STS

Structurally reduced latency: a single model eliminates round trips between components. The best native STS systems achieve 160–700 ms vs. 500–1,260 ms for optimized pipelines.
Emotional preservation: the acoustic signal passes through the model without loss of paralinguistic information.
Natural interruption handling: models can detect and react to interruptions in real time without waiting for a processing step to complete.
Integration simplicity: a single API, a single model, no multi-component pipeline to orchestrate.

Advantages of Optimized Pipelines

Controllability: each component (STT, LLM, TTS) is inspectable, debuggable, and independently adjustable.
Model selection flexibility: ability to combine the best components from each category (BYOM). You can change the LLM without touching the STT/TTS.
Enterprise maturity: regulatory compliance (HIPAA, GDPR), on-premise deployment, auditability of intermediate decisions.
Predictable cost: transparent and often lower pricing ($4.50/h at Deepgram vs. $3+/h for native STS).
Intermediate transcription: the transcribed text is available for logging, analysis, compliance, and training.

When to Choose What?

Criterion	Native STS recommended	Pipeline recommended
Latency priority	Latency < 500 ms critical	Latency < 1.5 s acceptable
Vocal empathy	Emotional detection required	Standardized responses sufficient
Compliance	Low regulatory requirements	HIPAA, GDPR, auditability required
LLM customization	Integrated model sufficient	Specific LLM required (fine-tuned, RAG)
Budget	Flexible budget	Cost optimization priority
Use case	Empathetic customer service, healthcare, education	Transactional automation, structured helpdesk

Business Benefits and Application Scenarios

Customer Satisfaction and Engagement

Native STS voicebots significantly improve user experience through more natural and responsive conversations. Hume AI's tests show that users strongly prefer native STS voices over traditional systems on expressiveness, naturalness, and interaction quality. Reducing latency below the 500 ms threshold eliminates the awkward pauses that cause users to abandon conversations.

Operational Efficiency

More natural voicebots can handle a wider range of interactions without human escalation, increasing first-contact resolution rates. Native interruption handling and conversational context management allow complex scenarios to be handled that would previously have required a human agent.

Priority Application Scenarios

Customer service: handling requests with emotional detection and tone adaptation (frustration → de-escalation)
Healthcare: appointment scheduling, medication reminders, symptom triage with vocal empathy
Finance: account information, transaction processing, advisory with natural interruption handling
Education: voice tutoring with adaptation to the learner's pace
Outbound telephony: calling campaigns where voice naturalness determines conversion rates
Vehicle assistants: low-latency hands-free interactions (xAI/Tesla model)

Versatik's Positioning

At Versatik, we offer high-performance voice voicebots for inbound call reception and outbound calling campaigns, leveraging the best available technologies to deliver automated voice interactions that approach human conversation.

Our approach integrates advances in native speech-to-speech while maintaining the robustness of proven enterprise production architectures. Depending on each client's specific needs — latency, compliance, customization, budget — we deploy the optimal architecture:

Native STS (OpenAI Realtime, Google Gemini Live, xAI Grok, Hume EVI) for use cases requiring maximum reactivity and vocal empathy
Optimized pipeline for scenarios with strong requirements for control, compliance, or LLM customization

Our voicebots significantly reduce the typical latency of voice processing, enabling conversations that flow naturally. For inbound call reception, our technology provides immediate and natural responses with emotional detection. In outbound applications, our voicebots conduct conversations that callers struggle to distinguish from human communication.

By combining native STS technologies and optimized pipelines depending on the use case, Versatik enables businesses to gain competitive advantage through superior customer experiences, increased operational efficiency, and better resolution rates for automated interactions.