concept

Voice (LLM apps)

Voice in LLM apps refers to the full speech pipeline — speech-to-text (STT), language model, text-to-speech (TTS) — that lets users converse with an AI assistant in spoken language.

By 2026 voice is a first-class LLM modality. The reference architecture: streaming STT (Deepgram Nova-3, OpenAI Whisper, Whisper-3) → LLM with prompt caching → streaming TTS (Cartesia Sonic 2, ElevenLabs Turbo, OpenAI Realtime voices). Integrated platforms (OpenAI Realtime API, Anthropic voice integrations) combine the layers for the simplest path. The hard production problem is end-to-end latency under 800ms for conversational feel — every stage contributes 100-300ms, and naive composition produces unusable lag. Best practice: stream throughout, use prompt caching for the LLM stage, and pick TTS with sub-200ms first-byte streaming.

When to use voice (llm apps)

Common mistakes

FAQ

What is voice (llm apps)?

Voice in LLM apps refers to the full speech pipeline — speech-to-text (STT), language model, text-to-speech (TTS) — that lets users converse with an AI assistant in spoken language.

When should I use voice (llm apps)?

Conversational assistants. Accessibility features (voice-first UX). Phone / IVR replacement.

What are the most common mistakes with voice (llm apps)?

Non-streaming composition — adds 1-2s lag. Skipping prompt caching on the LLM stage — kills total latency budget.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/voice.md.