concept

SSML (Speech Synthesis Markup Language)

SSML is the XML-based markup language for TTS — controls pronunciation, prosody (rate, pitch, volume), pauses, emphasis, voice swaps, and audio insertion. Google Cloud TTS, Amazon Polly, Azure Speech support full SSML; ElevenLabs + others support subsets in 2026.

Plain text into TTS gives default prosody — fine for short utterances, limiting for longer / nuanced content. SSML adds control: `<break time='500ms'/>` for explicit pauses, `<prosody rate='slow' pitch='+2st'>...</prosody>` for emotion, `<phoneme alphabet='ipa' ph='...'/>` for tricky pronunciation, `<emphasis level='strong'>...</emphasis>` for stress, `<say-as interpret-as='date' format='ymd'>2026-06-01</say-as>` for structured reads. Production use: audiobook narration, IVR / phone agents that need brand pronunciation, multilingual content with mixed languages mid-sentence. Trade-offs: full SSML is verbose; partial-SSML providers limit available tags. Some 2026 TTS APIs (ElevenLabs Eleven v3) accept emotion tags `[laughs]`, `[whispers]` as a lighter SSML alternative.

When to use ssml (speech synthesis markup language)

Common mistakes

FAQ

What is ssml (speech synthesis markup language)?

SSML is the XML-based markup language for TTS — controls pronunciation, prosody (rate, pitch, volume), pauses, emphasis, voice swaps, and audio insertion. Google Cloud TTS, Amazon Polly, Azure Speech support full SSML; ElevenLabs + others support subsets in 2026.

When should I use ssml (speech synthesis markup language)?

Audiobook + long-form narration. IVR / phone agents with brand pronunciations. Multilingual content.

What are the most common mistakes with ssml (speech synthesis markup language)?

Forgetting to escape `&`, `<`, `>` in source text — SSML parser fails. Over-marking — too many `<break>` tags make voice feel halting.

Sources

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/ssml.md.