🏗️

Voice AI Architecture Guide 2026

Cascaded • Speech-to-Speech • VAD • Tool Calling • Latency

Voice AIArchitectureVADSTTTTSLatencyGPT-4o RealtimeTechnical

Voice AI Architecture Guide: Cascaded vs Speech-to-Speech in 2026

Q: What is the difference between cascaded and speech-to-speech voice AI architectures?

Cascaded architecture (STT → LLM → TTS) processes voice in three separate stages: speech-to-text transcription, language model response generation, and text-to-speech synthesis. Speech-to-speech models like GPT-4o Realtime process audio directly without intermediate text, achieving sub-1s latency but offering less granular control. Cascaded offers maximum control and debuggability (2-4s latency typical), while speech-to-speech prioritizes natural conversation speed (<1s) at the cost of flexibility.

Q: What is VAD (Voice Activity Detection) and why does it matter?

Voice Activity Detection (VAD) determines when someone is speaking versus silence or background noise. It's critical for natural turn-taking in voice conversations—knowing when to start listening, when to stop, and when the user has finished speaking. Without proper VAD, agents produce awkward pauses, cut-off responses, or missed input. Modern ML-based VAD like Silero VAD processes 30ms audio chunks in <1ms on CPU, handling noisy environments with high accuracy across 6000+ languages.

Q: How can I achieve sub-500ms latency for natural voice conversations?

Hitting the 300-500ms natural conversation threshold requires: (1) Streaming at every stage—STT streams partial transcripts, LLM generates tokens incrementally, TTS synthesizes chunks as they arrive. (2) Fast provider selection—Deepgram Nova-3 for STT (100-150ms), Groq for LLM inference (100-200ms first token), Cartesia Turbo or ElevenLabs Flash for TTS (40-75ms TTFB). (3) Network optimization—WebSocket persistent connections, edge deployment, service co-location. (4) Or use speech-to-speech models (GPT-4o Realtime, Gemini 2.5 Flash) which achieve sub-1s naturally.

Q: When should I use cascaded architecture vs speech-to-speech?

Choose cascaded when: (1) You need specific compliance or certified models (healthcare, legal), (2) Fine-tuned domain models are critical (medical terminology, industry jargon), (3) Debugging/observability is essential (inspect intermediate text), (4) You want best-in-class components mixed together. Choose speech-to-speech when: (1) Sub-500ms latency is critical (consumer apps), (2) Emotional context matters (therapy, empathetic conversations), (3) Simple general Q&A workflows, (4) Natural interruption handling is important. Use hybrid for complex enterprise scenarios requiring both.

Q: How does tool calling work in voice AI?

Tool calling (function calling) allows voice agents to execute external actions during conversation—checking databases, updating CRMs, sending emails, booking appointments. In cascaded architectures, the LLM decides to call a tool, you execute it, then TTS speaks the result. In speech-to-speech (GPT-4o Realtime), native async function calling lets the model continue conversation while tools execute in background. Modern platforms like Vapi, Grok Voice API, and OpenAI Realtime API support tool calling with varying levels of synchronous vs asynchronous execution.

Q: What is Silero VAD and how do I implement it?

Silero VAD is a pre-trained enterprise-grade Voice Activity Detector supporting 6000+ languages with <1ms processing time per 30ms audio chunk on CPU. It uses PyTorch-based deep neural networks to classify speech vs non-speech with high accuracy in noisy environments. Implementation: Install PyTorch, load model from torch.hub (repo: snakers4/silero-vad), configure threshold (0.5 balanced), min speech duration (250ms), and min silence duration (100-500ms). The model is ~2MB, MIT licensed, and widely used by LiveKit, Pipecat, and production voice platforms.

By Claude & Jozo • February 5, 2026 • 18 min read

300ms

Natural Response Time

Core Architectures

40ms

Best VAD Latency

2-4s

Cascaded Latency

Building a production voice AI system in 2026 requires architectural decisions that fundamentally impact latency, quality, and control. The choice between cascaded architectures (STT → LLM → TTS) and speech-to-speech models (GPT-4o Realtime, Gemini 2.5 Flash) determines everything from debugging complexity to user experience.

This technical guide breaks down the three core voice AI architectures, explains Voice Activity Detection (VAD) for natural turn-taking, covers tool calling in voice contexts, and provides a latency optimization framework for hitting that critical 300-500ms response window humans expect.

🎯 The Critical Architecture Decision

Human conversation requires 300-500ms response time to feel natural. Delays beyond 500 milliseconds feel unnatural. Your architecture determines whether you hit that target.

Cascaded (STT→LLM→TTS): 2-4 seconds typical, maximum control

Speech-to-Speech: Sub-1s possible, less granular control

💡 The 2026 reality: Speech-to-speech models like GPT-4o Realtime and Gemini 2.5 Flash have closed the quality gap while cutting latency in half. But cascaded architectures still win for enterprise scenarios requiring specific compliance, custom models, or fine-grained debugging.

🔗 Architecture 1: Cascaded (Traditional)

How It Works

🎤

Step 1: Speech-to-Text (STT)

User speaks → Audio captured → STT model transcribes to text

Latency: 100-500ms (streaming vs batch)

🧠

Step 2: Language Model Processing (LLM/GenAI)

Text processed → LLM generates response text

Latency: 200-2000ms (depends on model size, prompt complexity)

🔊

Step 3: Text-to-Speech (TTS)

Response text → TTS synthesizes audio → User hears response

Latency: 200-800ms (varies with streaming, quality)

✅ Advantages

Best-in-class components: Mix Deepgram STT + GPT-4 + ElevenLabs TTS
Maximum control: Swap any component independently
Easier debugging: Inspect text at each stage
Compliance-friendly: Use specific certified models for regulated industries
Custom model support: Fine-tuned models for domain-specific tasks

⚠️ Disadvantages

Higher total latency: 2-4s end-to-end in good conditions
Integration complexity: Coordinate 3+ separate services
Lost audio nuance: Tone, emotion, laughter disappear in transcription
More failure points: Each component can fail independently
Network overhead: Multiple round-trips add 50-200ms each

Example: Cascaded Voice Agent with Deepgram + GPT-4 + ElevenLabs

import Deepgram from "@deepgram/sdk";
import OpenAI from "openai";
import { ElevenLabsClient } from "elevenlabs";

// Initialize clients
const deepgram = new Deepgram(process.env.DEEPGRAM_API_KEY);
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const elevenlabs = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY });

async function processVoiceInput(audioBuffer) {
  // Step 1: STT - Transcribe audio (100-500ms)
  const transcription = await deepgram.transcription.preRecorded({
    buffer: audioBuffer,
    mimetype: "audio/wav"
  }, { model: "nova-3", language: "en" });

  const userText = transcription.results.channels[0].alternatives[0].transcript;
  console.log("User said:", userText);

  // Step 2: LLM - Generate response (200-2000ms)
  const completion = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [
      { role: "system", content: "You are a helpful voice assistant." },
      { role: "user", content: userText }
    ],
    max_tokens: 150
  });

  const responseText = completion.choices[0].message.content;
  console.log("Assistant responds:", responseText);

  // Step 3: TTS - Synthesize speech (200-800ms)
  const audio = await elevenlabs.generate({
    voice: "Rachel",
    text: responseText,
    model_id: "eleven_multilingual_v2"
  });

  return audio; // Total: 500-3300ms depending on streaming optimizations
}

⏱️ Latency breakdown: Best case with streaming: ~500-800ms total (STT 100ms + LLM 200ms + TTS 200ms + network 100ms). Typical production: 2-4 seconds without aggressive optimization.

🚀 Architecture 2: Speech-to-Speech (End-to-End)

How It Works

Models like GPT-4o Realtime, Gemini 2.5 Flash Native Audio, and Grok Voice API process audio directly without intermediate text transcription.

🎙️

Single-Step Processing

User speaks → Native audio model processes → Audio response generated

Latency: Sub-1 second end-to-end (Grok: <1s, GPT-4o Realtime varies)

The model is trained to analyze the raw acoustic signal, bypassing the need for intermediary text transcription (ASR) and text synthesis (TTS). This preserves tone, emotion, laughter, and non-verbal cues.

✅ Advantages

Lowest latency: Sub-1s response time in production
Preserves emotion: Tone, laughter, non-verbal cues maintained
More natural: Responds to how user speaks, not just what they say
Simpler integration: Single API call, fewer moving parts
Better interruptions: Native handling of barge-in scenarios

⚠️ Disadvantages

Less control: Can't swap STT, LLM, TTS independently
Harder debugging: No intermediate text to inspect
Vendor lock-in: Tied to provider's full stack
Limited customization: Can't use fine-tuned domain models
Compliance challenges: May not meet specific regulatory requirements

GPT-4o Realtime

OpenAI GPT-4o Realtime API

The GPT-4o Realtime API lets you create a persistent WebSocket connection to exchange messages with GPT-4o, sending audio input and receiving audio responses in real-time. Available via WebRTC or WebSocket.

The new gpt-realtime model shows higher intelligence, can comprehend native audio with greater accuracy, captures non-verbal cues (like laughs), and switches languages mid-sentence. Pricing reduced by 20%: $32/1M audio input tokens and $64/1M audio output tokens.

Key improvements: Better at following complex instructions, calling tools with precision, producing speech that sounds more natural and expressive, interpreting system messages and developer prompts.

Google

Gemini 2.5 Flash Native Audio

Google's Gemini 2.5 Flash Native Audio is trained to analyze the raw acoustic signal itself, bypassing intermediary text transcription and synthesis. The recent update improves the model's ability to handle complex workflows, navigate user instructions, and hold natural conversations.

Emotional Intelligence: Models using Gemini Live API native audio can understand and respond appropriately to users' emotional expressions for more nuanced conversations. Better conversation quality with improved multi-turn context retrieval.

2026 roadmap: Live speech translation will continue to be iterated on and brought to more Google products including the Gemini API in 2026.

Example: GPT-4o Realtime API (WebSocket)

import WebSocket from "ws";

const ws = new WebSocket("wss://api.openai.com/v1/realtime", {
  headers: {
    "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
    "OpenAI-Beta": "realtime=v1"
  }
});

ws.on("open", () => {
  // Configure session for voice conversation
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      modalities: ["text", "audio"],
      instructions: "You are a helpful voice assistant.",
      voice: "alloy",
      input_audio_format: "pcm16",
      output_audio_format: "pcm16",
      turn_detection: { type: "server_vad" } // Built-in VAD
    }
  }));
});

ws.on("message", (data) => {
  const event = JSON.parse(data);

  if (event.type === "response.audio.delta") {
    // Stream audio chunks directly to speaker (sub-1s latency)
    playAudioChunk(event.delta);
  }

  if (event.type === "response.audio.done") {
    console.log("Response complete");
  }
});

// Send audio input
function sendUserAudio(audioBuffer) {
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: audioBuffer.toString("base64")
  }));
}

💡 When to choose Speech-to-Speech: Consumer-facing apps where latency is critical, emotional conversations (therapy, customer service), scenarios requiring natural interruptions, applications where "how" matters as much as "what" is said.

⚡ Architecture 3: Hybrid Approaches

Combining Best of Both Worlds

Hybrid architectures use speech-to-speech for general conversation but switch to cascaded for specific tasks requiring custom models or precise control.

Use S2S for:

• General conversation and chitchat
• Quick FAQs and simple queries
• Scenarios requiring empathy/emotion

Switch to Cascaded for:

• Medical/legal transcription (compliance)
• Domain-specific fine-tuned models
• Complex tool calling workflows

Fallback Strategies

Production systems implement fallbacks to maintain reliability:

1. Primary: Speech-to-Speech → Fast, natural response for 80% of queries
2. Fallback: Cascaded → When S2S fails or encounters unsupported language
3. Emergency: Static responses → Pre-recorded messages if all services down

Example: Customer service bot uses GPT-4o Realtime for conversations, but switches to Deepgram STT + custom LLM + ElevenLabs TTS when accessing proprietary knowledge base requiring fine-tuned embeddings.

🎙️ VAD: Voice Activity Detection Deep Dive

What is VAD?

Voice Activity Detection (VAD) is the process of determining when someone is speaking versus silence or background noise. It's the foundation of natural turn-taking in voice conversations.

Why it matters: Without VAD, your agent doesn't know when to start listening, when to stop, or when the user has finished speaking. Poor VAD = awkward pauses, cut-off responses, or missed input.

Types of VAD

📊 Energy-Based

Simple threshold detection based on audio amplitude.

✓ Fast (~1ms processing)

✗ Fails with background noise

🤖 ML-Based

Neural networks trained to detect speech vs non-speech.

✓ Handles noise well

✓ High accuracy (Silero VAD)

☁️ Server-Side

VAD handled by the voice platform (GPT-4o, Gemini).

✓ Zero implementation needed

✗ Network latency overhead

🧠 Semantic/CSR

Understands language context to detect turn completion, not just silence.

✓ ~30% fewer false interruptions

✓ Replaces VAD+endpointing stack

Silero VAD: The Industry Standard

Silero VAD is a pre-trained enterprise-grade Voice Activity Detector that supports 6000+ languages and performs well on audios from different domains with various background noise and quality levels. One audio chunk (30+ ms) takes less than 1ms to process on a single CPU thread.

Technical Specs:

• Supports 8000 Hz and 16000 Hz sampling rates
• JIT model ~2MB in size
• Published under MIT license (zero restrictions)
• PyTorch-based, ONNX runtime available

Performance:

• <1ms CPU processing per 30ms chunk
• Handles noisy environments well
• Used by LiveKit, Pipecat, and major platforms
• High accuracy speech detection

Silero VAD Implementation (PyTorch)

import torch

# Load Silero VAD model from torch hub
model, utils = torch.hub.load(
    repo_or_dir='snakers4/silero-vad',
    model='silero_vad',
    force_reload=False
)

(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils

# Read audio file (16kHz recommended)
wav = read_audio('audio.wav', sampling_rate=16000)

# Get speech timestamps with configurable parameters
speech_timestamps = get_speech_timestamps(
    wav,
    model,
    threshold=0.5,              # Speech probability threshold
    sampling_rate=16000,
    min_speech_duration_ms=250, # Minimum speech segment
    min_silence_duration_ms=100 # Minimum silence to split
)

print(speech_timestamps)
# Output: [{'start': 3200, 'end': 12800}, {'start': 16000, 'end': 24000}]

Deepgram Flux: Semantic Turn Detection (CSR)

Deepgram Flux is the first production Conversational Speech Recognition (CSR) model that fuses transcription with turn detection into a single unified system. Instead of relying on silence thresholds, Flux understands semantic completeness — recognizing that "because..." means the user isn't finished, while "Thanks." signals turn completion.

How it differs from traditional VAD:

• Detects "has the user finished their thought?" vs "is there silence?"
• Replaces separate STT + VAD + endpointing pipeline with one model
• Emits StartOfTurn and EndOfTurn events natively
• ~30% fewer false interruptions vs silence-based approaches

Performance (VAQI benchmark):

• Median end-of-turn detection within 1.5s (p95)
• Nova-3 level transcription accuracy maintained
• Two config params: eot_threshold + eot_silence_threshold_ms
• "Eager" mode: speculative LLM calls for 150-250ms faster responses

Deepgram Flux Integration (replaces VAD + STT pipeline)

import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";

const deepgram = createClient(process.env.DEEPGRAM_API_KEY);

// Single connection replaces separate VAD + STT + endpointing
const connection = deepgram.listen.live({
  model: "nova-3",
  smart_format: true,
  // Flux semantic turn detection — replaces silence-based VAD
  endpointing: "flux",
  eot_threshold: 0.5,              // Semantic completion confidence
  eot_silence_threshold_ms: 800,   // Fallback silence threshold
});

connection.on(LiveTranscriptionEvents.Transcript, (data) => {
  const transcript = data.channel.alternatives[0].transcript;
  if (data.is_final) console.log("Partial:", transcript);
});

// Native turn events — no separate VAD needed
connection.on(LiveTranscriptionEvents.EndOfTurn, () => {
  console.log("User finished speaking — generate response");
});

connection.on(LiveTranscriptionEvents.StartOfTurn, () => {
  console.log("User started speaking — stop agent audio");
});

When to use Flux over Silero: Flux is ideal when you want to eliminate the VAD+endpointing layer entirely and get turn detection baked into transcription. Silero remains better for client-side preprocessing, offline use, or when you need VAD independent of any cloud STT provider.

Key VAD Concepts

Speech Probability Threshold

The confidence level (0.0-1.0) required to classify audio as "speech". Lower = more sensitive (catches whispers but more false positives). Higher = more strict (misses quiet speech but fewer false triggers).

Typical values: 0.5 (balanced), 0.3 (sensitive), 0.7 (strict)

Min Speech/Silence Duration

Min speech duration: Ignore speech segments shorter than this (filters out random noise spikes). Typical: 250ms.
Min silence duration: How long silence must last to consider speech ended. Typical: 100-500ms depending on use case.

Padding (Pre/Post Speech)

Extra audio to include before/after detected speech to avoid cutting off the start or end of words. Pre-padding: ~100-200ms. Post-padding: ~100-300ms.

Turn-Taking and Interruption Handling

Turn detection determines when a user begins or ends their "turn" in a conversation, letting the agent know when to start listening and when to respond. Barge-in detection handles when users interrupt the agent mid-response.

Silence-Based (Traditional):

• Detect silence after speech ends
• Tuned via min_silence_duration
• Fast but prone to false triggers on pauses

Semantic (Flux CSR):

• Model understands if thought is complete
• "because..." = not done, "Thanks." = done
• ~30% fewer false cutoffs

Barge-In Handling:

• VAD detects speech during agent playback
• Immediately stop agent audio output
• Buffer user speech, restart turn detection

Best practice: For silence-based VAD, process audio in 10-20ms intervals using Silero ONNX. For semantic turn detection, use Deepgram Flux with eot_threshold tuned to your use case — lower values respond faster but risk more interruptions.

🔧 Tool Calling / Function Calling in Voice AI

What is Tool Calling in Voice Context?

Tool calling (or function calling) in voice AI allows the agent to execute external actions during a conversation—checking databases, updating CRMs, pulling live data, sending emails, booking appointments—all while maintaining natural dialogue.

Examples: "Book me an appointment" → Agent calls calendar API.
"Check my balance" → Agent queries account database.
"Send an email to Sarah" → Agent triggers email service.

Architectures for Tool Calling

⏸️ Synchronous (User Waits)

Agent pauses conversation, executes tool, returns result, then continues.

Flow: User: "What's the weather?"
Agent: [calls weather API]
Agent: "It's 72°F and sunny."

✓ Simple to implement

✗ Awkward silence for slow tools

🔄 Asynchronous (Background Execution)

Agent continues conversation while tool executes in background.

Flow: User: "Email my report"
Agent: "Sending that now. Anything else?"
[Email sends in background]

✓ No awkward pauses

✗ More complex to implement

Implementation Patterns

In Cascaded Architecture

The LLM decides to call a tool, you execute it, then TTS speaks the result.

// Example: GPT-4 with function calling in cascaded pipeline
const completion = await openai.chat.completions.create({
  model: "gpt-4",
  messages: [
    { role: "user", content: "What's the weather in San Francisco?" }
  ],
  tools: [
    {
      type: "function",
      function: {
        name: "get_weather",
        description: "Get current weather for a location",
        parameters: {
          type: "object",
          properties: {
            location: { type: "string", description: "City name" }
          },
          required: ["location"]
        }
      }
    }
  ]
});

// LLM decides to call get_weather function
const toolCall = completion.choices[0].message.tool_calls[0];
const weatherData = await getWeather(JSON.parse(toolCall.function.arguments).location);

// Feed result back to LLM
const finalResponse = await openai.chat.completions.create({
  model: "gpt-4",
  messages: [
    { role: "user", content: "What's the weather in San Francisco?" },
    completion.choices[0].message,
    {
      role: "tool",
      tool_call_id: toolCall.id,
      content: JSON.stringify(weatherData)
    }
  ]
});

// TTS speaks the final response
const audio = await elevenlabs.generate({
  text: finalResponse.choices[0].message.content,
  voice: "Rachel"
});

In Speech-to-Speech (GPT-4o Realtime)

Native function calling with asynchronous support—the model can continue a fluid conversation while waiting on tool results.

// Define tools in session configuration
ws.send(JSON.stringify({
  type: "session.update",
  session: {
    tools: [
      {
        type: "function",
        name: "check_order_status",
        description: "Check the status of a customer order",
        parameters: {
          type: "object",
          properties: {
            order_id: { type: "string" }
          },
          required: ["order_id"]
        }
      }
    ]
  }
}));

// When model calls a tool, you receive event
ws.on("message", async (data) => {
  const event = JSON.parse(data);

  if (event.type === "response.function_call_arguments.done") {
    const { call_id, name, arguments } = event;

    // Execute tool asynchronously
    const result = await checkOrderStatus(JSON.parse(arguments).order_id);

    // Send result back to session
    ws.send(JSON.stringify({
      type: "conversation.item.create",
      item: {
        type: "function_call_output",
        call_id: call_id,
        output: JSON.stringify(result)
      }
    }));
  }
});

Providers with Tool Calling

GPT-4o Realtime API:

Native async function calling. Model continues conversation while tools execute. Improved precision with gpt-realtime model.

Vapi.ai:

Advanced function calling during conversations. Check databases, update CRMs, pull live data mid-call. Supports custom tool integrations.

Grok Voice Agent API:

Tools called automatically based on conversation context. Search web, query documents, execute business logic. 100+ languages supported.

Model Context Protocol (MCP):

Standardized tool connections. MCP-compliant servers use tools/list and tools/call protocol. Works with Realtime API.

🎯 Best Practices: Confirm before destructive actions ("Should I delete this?"), provide progress updates for long operations ("Checking that now..."), handle errors gracefully in voice context ("I couldn't find that, could you try again?"), use async for long-running tools to avoid awkward silence.

⚡ Latency Optimization

The Human Conversation Threshold

Human conversation requires a 300-500ms response window to feel natural. Delays beyond 500 milliseconds feel unnatural. Based on analysis of 4M+ production voice agent calls, hitting this threshold requires aggressive optimization at every stage.

Latency Reality Check:

• Cascaded: 2-4s typical end-to-end (exceeds natural threshold)
• Cascaded optimized: 500-800ms achievable with streaming
• Speech-to-Speech: Sub-1s typical (within natural range)

Where Latency Comes From

STT Processing 100-500ms

Streaming reduces this to ~100-200ms. Batch processing can take 500ms+.

Best: Deepgram Nova-3 (streaming), Groq Whisper (fast batch)

LLM Inference (First Token Latency) 200-2000ms

The LLM is typically the highest latency component. Model size, prompt complexity, and provider speed all matter.

Best: Groq Llama 4 Maverick 17B (speed + capability balance)

TTS Synthesis (Time to First Audio Byte) 75-800ms

Streaming TTS dramatically reduces perceived latency. Time to first byte (TTFB) matters more than total synthesis time.

Best: ElevenLabs Flash (75ms TTFB), Cartesia Sonic Turbo (40ms)

Network Round-Trips 50-200ms

Each API call adds network latency. Cascaded architecture has 3+ round-trips.

Optimization: Deploy services close together, use WebSocket persistent connections, edge deployment for latency-sensitive components

VAD Processing Time 1-20ms

ML-based VAD (Silero) adds minimal latency when optimized. Energy-based VAD is nearly instant.

Best: Silero VAD with ONNX runtime (<1ms per 30ms chunk on CPU)

Optimization Techniques

1. Streaming at Every Stage

STT streams partial transcripts, LLM generates tokens incrementally, TTS synthesizes audio chunks as tokens arrive. This eliminates waiting for complete processing.

Impact: Reduces perceived latency from 4s to <1s

2. Speculative Generation

Start TTS synthesis before LLM completes full response. If LLM changes direction, discard speculative audio. Works well for predictable responses.

Impact: Saves 100-300ms on predictable turns

3. Edge Deployment

Deploy latency-sensitive components (VAD, lightweight STT) on edge nodes close to users. Heavy LLM inference can stay centralized.

Impact: Reduces network latency by 50-100ms

4. Connection Pooling / Keep-Alive

Use WebSocket persistent connections instead of HTTP request/response. Eliminate TCP handshake and TLS negotiation overhead on every interaction.

Impact: Saves 20-50ms per API call

Latency Budget Breakdown

To hit the 300ms natural conversation threshold, here's how to allocate your latency budget across a cascaded architecture with aggressive streaming optimization:

Component	Typical (No Opt)	Optimized (Streaming)	Optimization
VAD Detection	10-20ms	1-5ms	ONNX runtime, CPU optimization
STT Processing	300-500ms	100-150ms	Streaming STT (Deepgram Nova-3)
LLM First Token	500-2000ms	100-200ms	Fast inference (Groq), small prompts
TTS First Byte	200-800ms	40-75ms	Cartesia Turbo, ElevenLabs Flash
Network Overhead	150-300ms	50-100ms	WebSocket, edge deployment, co-location
Total End-to-End	2-4 seconds	300-530ms	Streaming + fast providers

* Latencies are approximate and vary based on network conditions, model selection, and implementation quality. These represent production-tested ranges from real-world deployments.

🎯 Choosing Your Architecture: Decision Framework

Decision Tree

Do you need sub-500ms latency for consumer apps?

→ Yes: Use Speech-to-Speech (GPT-4o Realtime, Gemini 2.5 Flash, Grok)
→ No: Continue...

Do you need specific compliance or certified models?

→ Yes: Use Cascaded with approved STT/LLM/TTS
→ No: Continue...

Do you need fine-tuned domain-specific models?

→ Yes: Use Cascaded with custom models
→ No: Continue...

Is debugging/observability critical?

→ Yes: Use Cascaded (inspect text at each stage)
→ No: Use Speech-to-Speech for simplicity and speed

🔗 Choose Cascaded

Medical/Legal: Compliance requires specific certified models
Enterprise B2B: Custom fine-tuned models for domain jargon
High-stakes debugging: Need to inspect intermediate outputs
Best-in-class quality: Willing to trade latency for top providers

Example: Healthcare voice scribe using HIPAA-compliant Deepgram + fine-tuned medical terminology LLM

🚀 Choose Speech-to-Speech

Consumer apps: Chatbots, virtual assistants, customer service
Low latency priority: Natural conversation feel is critical
Emotional contexts: Therapy, coaching, empathetic interactions
Simple workflows: General Q&A, FAQs, basic assistance

Example: Mental health chatbot using GPT-4o Realtime for empathetic, natural-sounding conversations with sub-1s response

⚡ Choose Hybrid

Complex enterprise: Mix of general conversation + specific tasks
Fallback scenarios: S2S primary, cascaded when it fails
Cost optimization: S2S for most, cascaded for cheaper specific tasks
Phased migration: Start cascaded, migrate to S2S gradually

Example: Banking voice agent using GPT-4o for conversation but cascaded with custom fraud detection model for account queries

Cost Considerations at Scale

At production scale (100k+ conversations/month), pricing models differ significantly:

Architecture	Typical Cost/Min	100k hrs/mo Cost	Notes
Cascaded (Premium)	$0.23-$0.33/min	$1.4M-$2M	Deepgram + GPT-4 + ElevenLabs
Cascaded (Budget)	$0.10-$0.15/min	$600k-$900k	Whisper + Llama + Cartesia
Speech-to-Speech (GPT-4o)	Varies by tokens	~$1M	$32/$64 per 1M audio tokens
Speech-to-Speech (Grok)	$0.05/min	$300k	Flat rate, all-in-one

* Costs are estimates based on published pricing as of February 2026. Actual costs vary based on conversation length, features used, and negotiated enterprise pricing.

💡 Hidden cost consideration: Cascaded architectures require managing 3+ vendor relationships, separate invoices, and integration maintenance. Speech-to-speech platforms (Grok, GPT-4o Realtime) offer simpler billing but less flexibility. Factor in engineering time when comparing costs.

❓ Frequently Asked Questions

What is the difference between cascaded and speech-to-speech voice AI architectures?

Cascaded architecture (STT → LLM → TTS) processes voice in three separate stages: speech-to-text transcription, language model response generation, and text-to-speech synthesis. Speech-to-speech models like GPT-4o Realtime process audio directly without intermediate text, achieving sub-1s latency but offering less granular control. Cascaded offers maximum control and debuggability (2-4s latency typical), while speech-to-speech prioritizes natural conversation speed (<1s) at the cost of flexibility.

What is VAD (Voice Activity Detection) and why does it matter?

Voice Activity Detection (VAD) determines when someone is speaking versus silence or background noise. It's critical for natural turn-taking in voice conversations—knowing when to start listening, when to stop, and when the user has finished speaking. Without proper VAD, agents produce awkward pauses, cut-off responses, or missed input. Modern ML-based VAD like Silero VAD processes 30ms audio chunks in <1ms on CPU, handling noisy environments with high accuracy across 6000+ languages.

How can I achieve sub-500ms latency for natural voice conversations?

Hitting the 300-500ms natural conversation threshold requires: (1) Streaming at every stage—STT streams partial transcripts, LLM generates tokens incrementally, TTS synthesizes chunks as they arrive. (2) Fast provider selection—Deepgram Nova-3 for STT (100-150ms), Groq for LLM inference (100-200ms first token), Cartesia Turbo or ElevenLabs Flash for TTS (40-75ms TTFB). (3) Network optimization—WebSocket persistent connections, edge deployment, service co-location. (4) Or use speech-to-speech models (GPT-4o Realtime, Gemini 2.5 Flash) which achieve sub-1s naturally.

When should I use cascaded architecture vs speech-to-speech?

Choose cascaded when: (1) You need specific compliance or certified models (healthcare, legal), (2) Fine-tuned domain models are critical (medical terminology, industry jargon), (3) Debugging/observability is essential (inspect intermediate text), (4) You want best-in-class components mixed together. Choose speech-to-speech when: (1) Sub-500ms latency is critical (consumer apps), (2) Emotional context matters (therapy, empathetic conversations), (3) Simple general Q&A workflows, (4) Natural interruption handling is important. Use hybrid for complex enterprise scenarios requiring both.

How does tool calling work in voice AI?

Tool calling (function calling) allows voice agents to execute external actions during conversation—checking databases, updating CRMs, sending emails, booking appointments. In cascaded architectures, the LLM decides to call a tool, you execute it, then TTS speaks the result. In speech-to-speech (GPT-4o Realtime), native async function calling lets the model continue conversation while tools execute in background. Modern platforms like Vapi, Grok Voice API, and OpenAI Realtime API support tool calling with varying levels of synchronous vs asynchronous execution.

What is Silero VAD and how do I implement it?

Silero VAD is a pre-trained enterprise-grade Voice Activity Detector supporting 6000+ languages with <1ms processing time per 30ms audio chunk on CPU. It uses PyTorch-based deep neural networks to classify speech vs non-speech with high accuracy in noisy environments. Implementation: Install PyTorch, load model from torch.hub (repo: snakers4/silero-vad), configure threshold (0.5 balanced), min speech duration (250ms), and min silence duration (100-500ms). The model is ~2MB, MIT licensed, and widely used by LiveKit, Pipecat, and production voice platforms.

📚 Related Reading

Best AI Voice Models 2026 →

Compare ElevenLabs, OpenAI, Cartesia, Grok, and more. Find the right TTS, STT, and voice agent provider for your use case.

Ready to Build Production Voice AI?

Whether you choose cascaded, speech-to-speech, or hybrid—start with the right architecture for your latency and quality requirements.

GPT-4o Realtime Docs → Silero VAD (Open Source) →

Last updated: February 5, 2026 • Research sourced from OpenAI, Google, Silero, AssemblyAI, and production voice AI platforms

Cascaded Architecture
Speech-to-Speech
Hybrid Approaches
VAD Deep Dive
Tool Calling
Latency Optimization
Choosing Your Architecture
FAQ

Voice AI Architecture Guide: Cascaded vs Speech-to-Speech in 2026

🎯 The Critical Architecture Decision

🔗 Architecture 1: Cascaded (Traditional)

How It Works

✅ Advantages

⚠️ Disadvantages

🚀 Architecture 2: Speech-to-Speech (End-to-End)

How It Works

✅ Advantages

⚠️ Disadvantages

OpenAI GPT-4o Realtime API

Gemini 2.5 Flash Native Audio

⚡ Architecture 3: Hybrid Approaches

Combining Best of Both Worlds

Fallback Strategies

🎙️ VAD: Voice Activity Detection Deep Dive

What is VAD?

Types of VAD

📊 Energy-Based

🤖 ML-Based

☁️ Server-Side

🧠 Semantic/CSR

Silero VAD: The Industry Standard

Deepgram Flux: Semantic Turn Detection (CSR)

Key VAD Concepts

Speech Probability Threshold

Min Speech/Silence Duration

Padding (Pre/Post Speech)

Turn-Taking and Interruption Handling

🔧 Tool Calling / Function Calling in Voice AI

What is Tool Calling in Voice Context?

Architectures for Tool Calling

⏸️ Synchronous (User Waits)

🔄 Asynchronous (Background Execution)

Implementation Patterns

In Cascaded Architecture

In Speech-to-Speech (GPT-4o Realtime)

Providers with Tool Calling

⚡ Latency Optimization

The Human Conversation Threshold

Where Latency Comes From

Optimization Techniques

1. Streaming at Every Stage

2. Speculative Generation

3. Edge Deployment

4. Connection Pooling / Keep-Alive

Latency Budget Breakdown

🎯 Choosing Your Architecture: Decision Framework

Decision Tree

🔗 Choose Cascaded

🚀 Choose Speech-to-Speech

⚡ Choose Hybrid

Cost Considerations at Scale

❓ Frequently Asked Questions

What is the difference between cascaded and speech-to-speech voice AI architectures?

What is VAD (Voice Activity Detection) and why does it matter?

How can I achieve sub-500ms latency for natural voice conversations?

When should I use cascaded architecture vs speech-to-speech?

How does tool calling work in voice AI?

What is Silero VAD and how do I implement it?

📚 Related Reading

Best AI Voice Models 2026 →

Ready to Build Production Voice AI?

Contents