TTS Latency Benchmark 2025: We Tested 7 Text-to-Speech APIs for Voice AI

vexyl.ai
December 10, 2025
TTS Latency Benchmark 2025

Building a real-time voice AI agent? Latency is everything.

Users expect instant responses. A 2-second delay feels like an eternity. Yet most TTS providers market “low latency” without revealing real-world performance from different regions.

We ran comprehensive benchmarks on 7 major TTS providers from both India and US servers to find out which ones actually deliver.

Spoiler: The winner depends entirely on where your server is located.

Quick Results

From US/North America Servers:

RankProviderAvg Latency
1Deepgram436ms
2ElevenLabs1,041ms
3OpenAI TTS2,049ms

From India/Asia Servers:

RankProviderAvg Latency
1Azure Neural363ms
2Sarvam AI952ms
3ElevenLabs1,295ms

Bottom line: Use Deepgram in the US, Azure in Asia. Simple as that.

Why We Ran This TTS Latency Benchmark

We’re building Vexyl AI Voice Gateway – an open-source voice assistant platform for Asterisk PBX. Our users deploy servers globally, and we kept hearing:

“Why is my TTS so slow?”

The answer was always: wrong provider for your region.

Most TTS benchmarks are run from US data centers. That’s useless if you’re deploying in India, Europe, or Southeast Asia.

So we tested from both locations to give you actionable data.

Test Methodology

Test Servers

India Server (Delhi):

  • Intel Xeon E-2356G @ 3.20GHz
  • 4GB RAM, Debian 13
  • 500 Mbps symmetric, 0.74ms idle latency
  • ISP: Airtel

US Server (Canada – OVH):

  • Production VPS
  • Low-latency network connection

Test Parameters

  • Iterations: 3 per text length
  • Text lengths: Short (32 chars), Medium (123 chars)
  • Metric: Time from API call to complete audio received
  • Cache: Disabled (measuring raw API performance)

Providers Tested

  1. Deepgram Aura
  2. ElevenLabs (eleven_multilingual_v2)
  3. Azure Neural TTS
  4. OpenAI TTS (tts-1)
  5. Sarvam AI (bulbul:v2)
  6. Google Cloud TTS (Chirp3-HD)
  7. Gemini TTS

Full Results: India vs US

Side-by-Side Comparison

ProviderIndia ServerUS ServerDifference
Deepgram1,830ms436ms-76%
Azure (Central India)363ms1,876ms+417%
ElevenLabs1,295ms1,041ms-20%
Sarvam AI952ms2,292ms+141%
OpenAI TTS2,129ms2,049ms-4%

What This Tells Us

  1. Deepgram has US-only servers – 280ms network RTT from India makes it unusable for Asian deployments
  2. Azure regional endpoints matter – Central India region delivers 363ms from India but 1,876ms from US
  3. ElevenLabs has decent global coverage – Consistent ~1 second latency from both regions
  4. Sarvam AI is India-optimized – Great from India (952ms), terrible from US (2,292ms)
  5. OpenAI TTS is slow everywhere – ~2 seconds regardless of location

Detailed Results: India Server

Network RTT to TTS API Endpoints

Before looking at TTS latency, let’s see the raw network round-trip times:

ProviderAPI EndpointNetwork RTT
ElevenLabsapi.elevenlabs.io14.9ms
Google Cloudtexttospeech.googleapis.com15.6ms
Sarvam AIapi.sarvam.ai26.1ms
Azurecentralindia.tts.speech.microsoft.com28.3ms
Deepgramapi.deepgram.com280.4ms

Deepgram’s 280ms RTT from India explains everything. Each API call requires:

  • 280ms to reach server
  • Processing time
  • 280ms to return audio

That’s 560ms minimum before any processing even happens.

TTS Latency Results (India)

ProviderShort TextMedium TextAverage
Azure Neural326ms399ms363ms
Sarvam AI774ms1,130ms952ms
ElevenLabs1,025ms1,564ms1,295ms
Deepgram1,665ms1,994ms1,830ms
OpenAI TTS1,859ms2,399ms2,129ms

Winner from India: Azure Neural TTS (363ms)

Detailed Results: US Server

TTS Latency Results (US/Canada)

ProviderShort TextMedium TextAverage
Deepgram397ms474ms436ms
ElevenLabs796ms1,286ms1,041ms
Azure (Central India)1,712ms2,040ms1,876ms
OpenAI TTS1,963ms2,134ms2,049ms
Sarvam AI1,867ms2,717ms2,292ms

Winner from US: Deepgram (436ms)


Provider Deep Dives

Deepgram Aura

Best for: US/North America deployments

Pros:

  • Fastest from US (436ms)
  • Cost-effective ($0.015/1K chars)
  • Good voice quality
  • Simple REST API

Cons:

  • US-only servers (280ms RTT from Asia)
  • Not suitable for global deployments without edge routing

Verdict: Excellent choice if your servers are in North America. Avoid for Asia/India.


Azure Neural TTS

Best for: Regional deployments (use local Azure region)

Pros:

  • Fastest from Asia when using regional endpoints (363ms)
  • Multiple regions available (East US, Central India, etc.)
  • Enterprise-grade reliability
  • SSML support for fine control

Cons:

  • Must provision correct region
  • Slightly more complex setup
  • Regional keys don’t work cross-region

Verdict: Best option for Asia deployments. Use the Azure region closest to your server.


ElevenLabs

Best for: Voice quality priority, global deployments

Pros:

  • Best voice quality (subjectively)
  • Decent global latency (~1 second)
  • WebSocket streaming available
  • Voice cloning capabilities

Cons:

  • Higher latency than Deepgram/Azure
  • More expensive
  • Rate limits on concurrent requests

Verdict: Choose when voice quality matters more than speed. Good middle-ground for global deployments.


Sarvam AI

Best for: Indian language support

Pros:

  • Native Indian language support (Hindi, Tamil, Telugu, etc.)
  • India-based servers (low latency from India)
  • Good for regional Indian deployments

Cons:

  • Slow from outside India
  • Limited to Indian languages
  • Smaller voice selection

Verdict: Best choice for Indian language voice AI. Not suitable for global English deployments.


OpenAI TTS

Best for: Simplicity (if already using OpenAI)

Pros:

  • Simple API (same as other OpenAI services)
  • Decent voice quality

Cons:

  • Consistently slow (~2 seconds)
  • No regional optimization
  • Not suitable for real-time voice AI

Verdict: Only use if you’re already locked into OpenAI ecosystem and latency isn’t critical.


Google Cloud TTS (Chirp3-HD)

Best for: Batch processing, quality priority

Results from our full benchmark:

  • Short: 799ms
  • Medium: 1,375ms
  • Long: 3,438ms

Verdict: Chirp3-HD is too slow for real-time. Use Neural2 voices instead for voice agents.


Gemini TTS

Best for: Experimental use only

Results:

  • Short: 3,301ms
  • Medium: 5,243ms
  • Long: 11,512ms

Verdict: Not production-ready. Avoid for voice applications.

Latency Thresholds for Voice AI

How do users perceive different latencies?

LatencyUser PerceptionSuitability
<200msImperceptibleExcellent
200-400msNatural conversationGood
400-600msSlightly noticeableAcceptable
600-800msFeels slowUse with caching
>800msFrustratingNot recommended

Based on our tests:

  • Deepgram (US) and Azure (regional) are the only providers consistently under 500ms
  • Everything else requires TTS caching for acceptable user experience

Recommendations by Use Case

Real-Time Voice Agents (No Caching)

US/North America:

  1. Deepgram (436ms)
  2. ElevenLabs (1,041ms) – acceptable with streaming

India/Asia:

  1. Azure Neural – Central India region (363ms)
  2. Sarvam AI (952ms) – for Indian languages

Europe:

  1. Azure Neural – West Europe region
  2. ElevenLabs (global CDN)

Voice Agents with TTS Caching

If you implement response caching, provider choice matters less:

  • Cache hit: 3-5ms
  • Cache miss: 1-3 seconds

For survey bots or IVR systems with repetitive responses, 90%+ cache hit rates are achievable.

Batch Processing / Offline

When latency doesn’t matter:

  1. ElevenLabs – best quality
  2. Google Chirp3-HD – high quality
  3. Any provider based on cost

How to Reduce TTS Latency

1. Choose the Right Provider for Your Region

This is the biggest factor. Wrong region = 500ms+ penalty.

2. Implement TTS Caching

Cache common responses. A cache hit is 3-5ms vs 1-2 seconds for API calls.

// Example: Simple TTS cache key
const cacheKey = crypto
  .createHash('md5')
  .update(`${text}-${voice}-${language}`)
  .digest('hex');

3. Use Streaming Where Available

ElevenLabs and Deepgram support WebSocket streaming. Time-to-first-byte is faster than waiting for complete audio.

4. Optimize Text Length

Shorter text = faster TTS. Break long responses into chunks and stream them.

5. Pre-warm Common Phrases

For IVR systems, pre-generate audio for greetings and common responses during off-peak hours.

Our Testing Tools

We’ve open-sourced our benchmark scripts:

Standalone TTS Benchmark:

# Test from any server with Node.js
node standalone-tts-benchmark.js

Tests Deepgram, ElevenLabs, Azure, Sarvam, and OpenAI with hardcoded iterations.

Full Benchmark Suite:

# Includes network tests, speedtest, traceroute
./run-tts-benchmark.sh 5 en-IN

Available in our GitHub repository.

Conclusion

There’s no single “best” TTS provider. The right choice depends on:

  1. Server location – Use Deepgram in US, Azure in Asia
  2. Language requirements – Sarvam for Indian languages
  3. Quality vs speed tradeoff – ElevenLabs for quality, Deepgram/Azure for speed
  4. Budget – Deepgram is most cost-effective

For our Vexyl AI Voice Gateway, we recommend:

  • Default: Azure Neural with regional endpoint matching your server
  • US deployments: Deepgram
  • Indian languages: Sarvam AI
  • Premium quality: ElevenLabs with caching

About Vexyl AI

Vexyl AI builds open-source voice AI infrastructure. Our Voice Gateway connects any LLM to phone systems via Asterisk, with support for multiple STT and TTS providers.

Features:

  • Multi-provider STT/TTS support
  • Real-time voice activity detection
  • TTS caching for low latency
  • Human-in-the-loop call transfer
  • WebSocket and AudioSocket protocols

Get Started with Vexyl AI Voice Gateway


FAQ

Which TTS provider has the lowest latency?

It depends on location. Deepgram is fastest from US (436ms), Azure Neural is fastest from Asia (363ms) when using regional endpoints.

Why is Deepgram slow from India?

Deepgram only has US servers. The 280ms network round-trip time from India adds 560ms+ to every request.

Is ElevenLabs good for voice AI?

ElevenLabs has the best voice quality but ~1 second latency. It’s acceptable for voice AI with streaming, but not the fastest option.

What’s a good TTS latency for voice agents?

Under 500ms is ideal. 500-800ms is acceptable with caching. Over 1 second feels noticeably slow to users.

Should I use Google TTS Chirp3-HD?

No, Chirp3-HD is too slow (3+ seconds for long text). Use Google Neural2 voices instead for real-time applications.


Last updated: December 2025

Benchmark data collected using Vexyl AI Voice Gateway testing tools. All tests performed on production servers with real API calls.

Leave a Reply

Your email address will not be published. Required fields are marked *