# Voice pipeline

**Source:** https://promtable.com/glossary/voice-pipeline

> A voice pipeline is the chain of audio processing stages — VAD → STT → LLM → TTS → playback — composed in a streaming framework like Pipecat, LiveKit Agents, Vocode, or a managed platform's internal stack.

---
A voice pipeline is the chain of audio processing stages — VAD → STT → LLM → TTS → playback — composed in a streaming framework like Pipecat, LiveKit Agents, Vocode, or a managed platform's internal stack.

Voice agents require multiple stages: voice activity detection ([[vad]]) detects when the user is speaking, [[streaming-stt]] converts speech to text incrementally, the LLM reasons + generates response (often with tools), [[neural-tts]] streams the response back as audio, and audio playback handles the actual sound output. Each stage has streaming + non-streaming variants; sub-second latency requires streaming throughout. Frameworks like Pipecat express this as composable 'frames' flowing between processors; LiveKit Agents uses pluggable nodes; managed platforms (Vapi, Retell) hide the pipeline behind a dashboard. Production tuning: balance latency (more streaming = lower latency, more complexity) vs quality (per-stage smarts vs simple chaining).

## When to use

- Building voice agents from scratch.

## Common mistakes

- Non-streaming TTS — adds full sentence latency before playback starts.
- Skipping VAD — model talks over the user.

## Related terms

- [vad](https://promtable.com/glossary/vad)
- [streaming-stt](https://promtable.com/glossary/streaming-stt)
- [voice-agent-platform](https://promtable.com/glossary/voice-agent-platform)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/voice-pipeline
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/voice-pipeline".
Contact: info@vibecodingturkey.com.