How to Build the Lowest Latency Voice Agent in Vapi: Achieving ~465ms End-to-end Latency

Written by assemblyai | Published 2026/03/25
Tech Story Tags: ai | ai-voice-agent | voice-agents | vapi | speech-to-text | assemblyai | vapi-voice-agent | good-company

TLDRIn this comprehensive guide, we'll show you how to build a voice agent in Vapi that achieves an impressive ~465ms end-to-end latency—fast enough to feel truly conversational. via the TL;DR App

Voice AI applications are revolutionizing how we interact with technology, but latency remains the biggest barrier to creating truly conversational experiences. When users have to wait seconds for a response, the magic of natural conversation is lost.

In this comprehensive guide, we'll show you how to build a voice agent in Vapi that achieves an impressive ~465ms end-to-end latency—fast enough to feel truly conversational.

Understanding the latency challenge

Before diving into the configuration, it's crucial to understand that voice agent latency comes from multiple components in the pipeline:

  • Speech-to-Text (STT): Converting audio to text
  • Large Language Model (LLM): Processing and generating responses
  • Text-to-Speech (TTS): Converting text back to audio
  • Turn Detection: Determining when the user has finished speaking
  • Network Overhead: Data transmission delays

The key to ultra-low latency is optimizing each component and minimizing unnecessary delays.

The optimal configuration stack

Our target configuration achieves the following breakdown:

  • STT: 90ms (AssemblyAI Universal-Streaming)
  • LLM: 200ms (Groq Llama 4 Maverick 17B)
  • TTS: 75ms (Eleven Labs Flash v2.5)
  • Pipeline Total: 365ms
  • Network Overhead: 100ms (Web) / 600ms+ (Telephony)
  • Final Latency: ~465ms (Web) / ~965ms+ (Telephony)

Step 1: Configure speech-to-text with AssemblyAI

AssemblyAI's Universal-Streaming API is currently one of the fastest STT options available, delivering transcripts in just 90ms.

Key Configuration Settings:

Critical optimization: Disable formatting

This is perhaps the most important STT optimization that many developers overlook. By setting Format Turns : false, you eliminate unnecessary processing time that adds latency. Modern LLMs are perfectly capable of understanding unformatted transcripts, and this single change can save precious milliseconds in your pipeline.

Why this matters: Formatting processes like punctuation insertion, capitalization, and number formatting require additional computation. When every millisecond counts, these "nice-to-have" features become latency bottlenecks.

Step 2: Choose the right LLM - Groq's Llama 4 Maverick 17B

The LLM is typically the highest latency component in your voice pipeline, making model selection critical. Groq's Llama 4 Maverick 17B 128e Instruct offers the perfect balance of speed and capability.

Configuration:

Why Groq + Llama 4 Maverick?

  • Optimized Model: Llama 4 Maverick offers a best-in-class performance-to-cost ratio
  • Consistent Performance: 200ms processing time with minimal variance
  • Open Source: Cost-effective compared to proprietary alternatives

Pro Tip: Keep your maxTokens relatively low (150-200) for voice applications. Users expect concise responses in conversation, and shorter responses generate faster.

Step 3: Implement lightning-fast TTS with Eleven Labs Flash v2.5

Eleven Labs Flash v2.5 is engineered specifically for low-latency applications, achieving an impressive 75ms time-to-first-byte.

Configuration:

Key Settings Explained:

  • Optimize Streaming Latency: Set to 4 for maximum speed priority
  • Voice Selection: Choose simpler voices for faster processing
  • No Style Exaggeration: Higher values may increase latency slightly

Step 4: Optimize turn detection settings

This is where many developers unknowingly sabotage their latency optimization. Vapi's default turn detection settings include wait times that can add 1.5+ seconds to your response time—completely negating all your other optimizations.

Critical configuration in advanced settings:

Before:

After:

Why this matters as much as model choice:

The default settings often include:

  • Wait Seconds: 0.4s (unnecessary delay)
  • On PunctuationSeconds: 0.1s (unnecessary delay)
  • On No Punctuation Seconds: 1.5s (waiting when no punctuation detected)
  • On Number Seconds: 0.5s (unnecessary delay)

Since our STT has formatting disabled, the system would default to the 1.5s "no punctuation" delay—adding 1500ms to a pipeline that we've optimized to 365ms (4x!). This single setting can make or break your latency goals.

Network considerations and deployment

Web vs. telephony latency:

  • Web (WebRTC): ~100ms network overhead
  • Telephony (Twilio/Vonage): 600ms+ network overhead

Deployment tips:

  1. Choose regions wisely: Deploy close to your users
  2. Consider CDN: For global applications, use edge locations
  3. Monitor performance: Set up latency monitoring and alerts
  4. Test thoroughly: Network conditions vary significantly

Testing and monitoring your configuration

Key metrics to track:

  • End-to-end latency: Time from user stops speaking to agent starts responding
  • Component breakdown: Individual STT, LLM, TTS timings
  • Network overhead: Measure actual vs. expected network delays
  • User experience: Conduct user testing for perceived responsiveness

Common pitfalls and troubleshooting

1. Forgetting turn detection settings

Problem: Great model configuration, but 1.5s delays remain

Solution: Always check and optimize startSpeakingPlan settings

2. Over-engineering prompts

Problem: Long system prompts increase LLM processing time

Solution: Keep prompts concise and specific

3. Ignoring network conditions

Problem: Perfect configuration, but poor real-world performance

Solution: Test in various network conditions and locations

4. Choosing quality over speed

Problem: Using high-quality but slower models

Solution: For voice, prioritize speed; users value responsiveness over perfection

Conclusion

Building a voice agent with ~465ms end-to-end latency is achievable with the right configuration and attention to detail. The key insights are:

  1. Every component matters: Optimize STT, LLM, and TTS individually
  2. Turn detection is critical: Default settings can destroy your latency goals
  3. Disable unnecessary features: Formatting and other "nice-to-haves" add latency
  4. Test in realistic conditions: Network overhead varies significantly by deployment

By following this configuration and understanding the principles behind each optimization, you'll create voice agents that feel truly conversational. Remember, in voice AI, perceived speed often matters more than absolute accuracy—users will forgive minor imperfections but won't tolerate slow responses.

The future of voice AI lies in these ultra-responsive interactions. With this guide, you're now equipped to build voice agents that meet users' expectations for natural, real-time conversation.


Written by assemblyai | AssemblyAI builds advanced speech language models that power next-generation voice AI applications.
Published by HackerNoon on 2026/03/25