SSML (Speech Synthesis Markup Language)
SSML is the XML-based markup language for TTS — controls pronunciation, prosody (rate, pitch, volume), pauses, emphasis, voice swaps, and audio insertion. Google Cloud TTS, Amazon Polly, Azure Speech support full SSML; ElevenLabs + others support subsets in 2026.
Plain text into TTS gives default prosody — fine for short utterances, limiting for longer / nuanced content. SSML adds control: `<break time='500ms'/>` for explicit pauses, `<prosody rate='slow' pitch='+2st'>...</prosody>` for emotion, `<phoneme alphabet='ipa' ph='...'/>` for tricky pronunciation, `<emphasis level='strong'>...</emphasis>` for stress, `<say-as interpret-as='date' format='ymd'>2026-06-01</say-as>` for structured reads. Production use: audiobook narration, IVR / phone agents that need brand pronunciation, multilingual content with mixed languages mid-sentence. Trade-offs: full SSML is verbose; partial-SSML providers limit available tags. Some 2026 TTS APIs (ElevenLabs Eleven v3) accept emotion tags `[laughs]`, `[whispers]` as a lighter SSML alternative.
When to use ssml (speech synthesis markup language)
- Audiobook + long-form narration.
- IVR / phone agents with brand pronunciations.
- Multilingual content.
Common mistakes
- Forgetting to escape `&`, `<`, `>` in source text — SSML parser fails.
- Over-marking — too many `<break>` tags make voice feel halting.
FAQ
What is ssml (speech synthesis markup language)?
SSML is the XML-based markup language for TTS — controls pronunciation, prosody (rate, pitch, volume), pauses, emphasis, voice swaps, and audio insertion. Google Cloud TTS, Amazon Polly, Azure Speech support full SSML; ElevenLabs + others support subsets in 2026.
When should I use ssml (speech synthesis markup language)?
Audiobook + long-form narration. IVR / phone agents with brand pronunciations. Multilingual content.
What are the most common mistakes with ssml (speech synthesis markup language)?
Forgetting to escape `&`, `<`, `>` in source text — SSML parser fails. Over-marking — too many `<break>` tags make voice feel halting.
Related terms
- Neural TTS — Neural TTS is the modern (2017+) generation of text-to-speech using neural networks (Tacotron, WaveNet, FastSpeech, VITS, GPT-SoVITS) — the foundation behind every consumer TTS that doesn't sound robotic. Replaced concatenative + parametric TTS by ~2020.
- Voice cloning — Voice cloning takes a sample of someone speaking — sometimes as little as 30 seconds — and produces a model that can synthesise new speech in that voice.
Sources
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/ssml.md.