Voice (LLM apps)
Voice in LLM apps refers to the full speech pipeline — speech-to-text (STT), language model, text-to-speech (TTS) — that lets users converse with an AI assistant in spoken language.
By 2026 voice is a first-class LLM modality. The reference architecture: streaming STT (Deepgram Nova-3, OpenAI Whisper, Whisper-3) → LLM with prompt caching → streaming TTS (Cartesia Sonic 2, ElevenLabs Turbo, OpenAI Realtime voices). Integrated platforms (OpenAI Realtime API, Anthropic voice integrations) combine the layers for the simplest path. The hard production problem is end-to-end latency under 800ms for conversational feel — every stage contributes 100-300ms, and naive composition produces unusable lag. Best practice: stream throughout, use prompt caching for the LLM stage, and pick TTS with sub-200ms first-byte streaming.
When to use voice (llm apps)
- Conversational assistants.
- Accessibility features (voice-first UX).
- Phone / IVR replacement.
Common mistakes
- Non-streaming composition — adds 1-2s lag.
- Skipping prompt caching on the LLM stage — kills total latency budget.
FAQ
What is voice (llm apps)?
Voice in LLM apps refers to the full speech pipeline — speech-to-text (STT), language model, text-to-speech (TTS) — that lets users converse with an AI assistant in spoken language.
When should I use voice (llm apps)?
Conversational assistants. Accessibility features (voice-first UX). Phone / IVR replacement.
What are the most common mistakes with voice (llm apps)?
Non-streaming composition — adds 1-2s lag. Skipping prompt caching on the LLM stage — kills total latency budget.
Related terms
- Multimodal model — A multimodal model accepts more than one input type — text plus images, audio, or video — and reasons across them in a single forward pass.
- Prompt caching — Prompt caching reuses the model's internal state for a repeated prompt prefix so the API charges and computes the prefix only once across many calls.
- AI agent — An AI agent is a system where a language model autonomously plans and executes a sequence of tool calls to accomplish a goal.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/voice.md.