# Voice (LLM apps)

**Source:** https://promtable.com/glossary/voice

> Voice in LLM apps refers to the full speech pipeline — speech-to-text (STT), language model, text-to-speech (TTS) — that lets users converse with an AI assistant in spoken language.

---
Voice in LLM apps refers to the full speech pipeline — speech-to-text (STT), language model, text-to-speech (TTS) — that lets users converse with an AI assistant in spoken language.

By 2026 voice is a first-class LLM modality. The reference architecture: streaming STT (Deepgram Nova-3, OpenAI Whisper, Whisper-3) → LLM with prompt caching → streaming TTS (Cartesia Sonic 2, ElevenLabs Turbo, OpenAI Realtime voices). Integrated platforms (OpenAI Realtime API, Anthropic voice integrations) combine the layers for the simplest path. The hard production problem is end-to-end latency under 800ms for conversational feel — every stage contributes 100-300ms, and naive composition produces unusable lag. Best practice: stream throughout, use prompt caching for the LLM stage, and pick TTS with sub-200ms first-byte streaming.

## When to use

- Conversational assistants.
- Accessibility features (voice-first UX).
- Phone / IVR replacement.

## Common mistakes

- Non-streaming composition — adds 1-2s lag.
- Skipping prompt caching on the LLM stage — kills total latency budget.

## Related terms

- [multimodal](https://promtable.com/glossary/multimodal)
- [prompt-caching](https://promtable.com/glossary/prompt-caching)
- [agent](https://promtable.com/glossary/agent)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/voice
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/voice".
Contact: info@vibecodingturkey.com.