concept

Neural TTS

Neural TTS is the modern (2017+) generation of text-to-speech using neural networks (Tacotron, WaveNet, FastSpeech, VITS, GPT-SoVITS) — the foundation behind every consumer TTS that doesn't sound robotic. Replaced concatenative + parametric TTS by ~2020.

Pre-neural TTS sounded robotic (concatenative splices recorded fragments; parametric synthesizes from features). Neural TTS uses deep learning to map text directly to mel spectrograms or audio waveforms — natural prosody, expression, and pacing. Architectures: Tacotron + WaveNet (early), FastSpeech 2 (faster), VITS (end-to-end), GPT-SoVITS / XTTS (zero-shot voice clone), Voicebox (Meta, controllable). Production 2026 TTS APIs (Google Chirp 3, Azure Neural, ElevenLabs, Cartesia Sonic, OpenAI TTS, Edge TTS) all use neural architectures. The user-visible benefits: natural prosody, multilingual coverage, voice cloning, emotion + style control, sub-second streaming latency.

When to use neural tts

Common mistakes

FAQ

What is neural tts?

Neural TTS is the modern (2017+) generation of text-to-speech using neural networks (Tacotron, WaveNet, FastSpeech, VITS, GPT-SoVITS) — the foundation behind every consumer TTS that doesn't sound robotic. Replaced concatenative + parametric TTS by ~2020.

When should I use neural tts?

Any production TTS today.

What are the most common mistakes with neural tts?

Using non-neural TTS in 2026 — sounds dated and limits cloning / style features.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/neural-tts.md.