News

GPT Realtime: Speech-to-Speech and OpenAI's Voice AI Redefine Conversational Agents

September 3, 2025

OpenAI unveils gpt-realtime: a revolutionary model that processes audio directly, without intermediate text conversion. SIP, images, remote MCP, and record performance transform voice agents.

GPT-realtime: Speech-to-Speech and OpenAI's Voice AI Redefine the Future of Conversational Agents

Sep 3, 2025 | Voicebots

OpenAI has unveiled gpt-realtime, a next-generation voice synthesis and understanding model that disrupts the usual architecture of voice agents.

Where traditional systems chain multiple models (STT → LLM → TTS), gpt-realtime processes and generates audio _directly_ in a single stream, reducing latency, preserving prosodic nuances, and improving the fluidity perceived by the user.

In parallel, the Realtime API reaches general availability and adds crucial enterprise capabilities: SIP for telephony, image input to visually contextualize a conversation, and remote MCP servers to plug in business tools with one click.

A Unified Architecture: A Paradigm Shift

The era of fragmented pipelines is coming to an end. Each intermediate conversion (voice→text, text→voice) adds latency and can degrade the original intent (intonations, hesitations, accents, breathing).

By bringing together perception, reasoning, and synthesis within a single model and API, gpt-realtime preserves these subtle signals and delivers more natural responses, with conversation continuity much closer to human exchange.

For product teams, this means less "glue" between services and more control over the end-to-end experience.

Measured Performance: Unprecedented Results

The published evaluations are not micro-gains: they represent a generational leap on complex audio tasks. On Big Bench Audio, which measures reasoning capabilities from voice content, the model achieves 82.8% accuracy, versus 65.6% for a December 2024 model.

In other words, the agent better understands multi-step queries, reformulations, and prosodic implications — a key point as soon as you move beyond overly scripted scenarios.

  • General Intelligence (Big Bench Audio): 82.8% (vs 65.6%, +26.3%).
  • Instruction Following (MultiChallenge): 30.5% (vs 20.6%, +48.1%).
  • Function Calling (ComplexFuncBench): 66.5% (vs 49.7%, +33.8%).

Concretely, these gains translate to fewer misunderstandings and "retries" on the client side, better adherence to instructions (regulatory scripts, word-for-word readings), and more reliable action triggering in your systems (CRM, payments, booking) — all while maintaining conversation during longer operations.

Major Technical Innovations

Granular Voice Control and Expressiveness

The model follows fine directives such as "speak quickly and professionally" or "adopt an empathetic tone with a French accent."

This expressiveness paves the way for stable vocal personas (customer service, banking advisor, health support) where voice finally conveys rhythm, warmth, empathy, and assurance — thus _trust_.

Exclusive New Voices

Two new voices, Marin and Cedar, arrive exclusively on the Realtime API and are accompanied by widespread improvements to the eight existing voices (naturalness, intonation, artifact reduction).

Marin brings soothing warmth for support contexts; Cedar, professional energy for efficiency-oriented environments.

Understanding Non-Verbal Signals

Gpt-realtime captures paralinguistic markers (laughter, hesitations, mid-sentence language switching) and adapts its register accordingly.

Gone are overly mechanical "turn-taking" dialogues: interaction becomes more organic, with natural recoveries and targeted clarifications — as a human advisor would do.

Asynchronous Function Calls, Without Breaking Flow

Long tool calls (database queries, slow APIs) no longer interrupt the conversation: the model can continue to exchange, reformulate, or confirm while waiting for the response — _without_ code changes on the developer side.

This is essential for "real-time" voice journeys where you no longer want silent waits.

Multimodal Capabilities and Advanced Integrations

Real-Time Image Support

The agent can receive images (photos, screenshots, diagrams) to anchor the conversation in what the user sees: visual technical diagnosis, medical assistance (prescription reading, for example), educational support.

Your application keeps control over _when_ and _which_ images to share.

Remote MCP Servers

To extend the agent's capabilities, you can plug in a remote MCP server (e.g., Stripe, an in-house back-office, a knowledge base) directly into the Realtime session.

No more heavy custom integrations: you change the MCP server URL and new tools become immediately available.

Native SIP Connectivity

SIP support allows connecting the agent to telephony (PSTN), enterprise PBXs, and existing SIP terminals, facilitating production deployments in already-established architectures (call centers, receptions, agencies).

Industry Applications and Use Cases

In finance, the agent can read regulatory warnings word-for-word, verify credentials, trigger anti-fraud controls, and orchestrate complex journeys (verifications, follow-ups, appointment scheduling) while remaining compliant.

In healthcare, it accelerates clinical documentation, supports triage, and enriches telemedicine through images.

In retail, it unifies omnichannel support (web, mobile, store), personalizes voice product search, automates returns, and provides real-time emotion insights.

Technical Architecture and Implementation

Transport: the API operates via bidirectional _WebSocket_ for real-time audio streaming, with minimal latency and better resilience than simple HTTP polling.

Audio formats: PCM16 support, encodings adapted to low latencies, recommended sampling rates for natural rendering.

Server-side VAD: voice activity detection (thresholds, pre/suffixes temporal) simplifies the client and captures complete utterances, interruptions included.

Economics and Pricing Model

OpenAI reduces the cost of gpt-realtime by approximately 20% compared to a previous generation:

$32 / 1M audio tokens input, $64 / 1M audio tokens output (and $0.40 / 1M for input cache).

Additionally, fine context control allows intelligently capping tokens and truncating multiple turns at once, significantly reducing the bill on long sessions.

Implementation Challenges and Best Practices

Like any real-time system, quality depends on network stability and regional proximity. In production, plan for:

_(1)_ proactive monitoring (latency, errors, satisfaction),

_(2)_ fallback mechanisms (e.g., call back or send message if user disconnects),

_(3)_ data governance (logging, purging, consents), and

_(4)_ a training plan for your support/ops teams to best leverage the new conversational design paradigm.

Roadmap and Recommendations

  • Phase 1 — evaluate and pilot: benchmark against your current stack, network load tests, POC on 2-3 priority use cases with clear KPIs (resolution rate, NPS, AHT, escalation rate).
  • Phase 2 — deploy progressively: batch migration, journey instrumentation, progressive integration of images, SIP, and remote MCPs.
  • Phase 3 — industrialize and innovate: proprietary vocal personas, internal tooling (reusable prompts, tool catalogs), and publishing feedback to establish your leadership.

Conclusion: The Future of Human-Machine Interaction

Gpt-realtime is not a "simple upgrade": it's an architectural overhaul that brings voice closer to a primary interface for AI.

With its demonstrated progress on audio reasoning, instruction adherence, tool calling, and its new _building blocks_ (SIP, image input, MCP), enterprises finally have a robust foundation to build production voice agents — more useful, more expressive, and more reliable.

Organizations that test, measure, and iterate now will gain a sustainable head start on high-value use cases.