Voice pipeline
A voice pipeline is the chain of audio processing stages — VAD → STT → LLM → TTS → playback — composed in a streaming framework like Pipecat, LiveKit Agents, Vocode, or a managed platform's internal stack.
Voice agents require multiple stages: voice activity detection ([[vad]]) detects when the user is speaking, [[streaming-stt]] converts speech to text incrementally, the LLM reasons + generates response (often with tools), [[neural-tts]] streams the response back as audio, and audio playback handles the actual sound output. Each stage has streaming + non-streaming variants; sub-second latency requires streaming throughout. Frameworks like Pipecat express this as composable 'frames' flowing between processors; LiveKit Agents uses pluggable nodes; managed platforms (Vapi, Retell) hide the pipeline behind a dashboard. Production tuning: balance latency (more streaming = lower latency, more complexity) vs quality (per-stage smarts vs simple chaining).
When to use voice pipeline
- Building voice agents from scratch.
Common mistakes
- Non-streaming TTS — adds full sentence latency before playback starts.
- Skipping VAD — model talks over the user.
FAQ
What is voice pipeline?
A voice pipeline is the chain of audio processing stages — VAD → STT → LLM → TTS → playback — composed in a streaming framework like Pipecat, LiveKit Agents, Vocode, or a managed platform's internal stack.
When should I use voice pipeline?
Building voice agents from scratch.
What are the most common mistakes with voice pipeline?
Non-streaming TTS — adds full sentence latency before playback starts. Skipping VAD — model talks over the user.
Related terms
- Voice activity detection (VAD) — Voice activity detection is the lightweight signal-processing step that determines whether incoming audio contains speech — used to start STT, trigger barge-in, and gate microphone use in voice agents.
- Streaming STT — Streaming STT (speech-to-text) emits partial transcriptions as the user speaks — instead of waiting for end-of-utterance — enabling sub-second response from a voice assistant.
- Voice agent platform — A voice agent platform is a managed stack that combines STT + LLM + TTS + telephony into a single API for building production phone / voice agents — Vapi, Retell, Bland are the 2026 leaders.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/voice-pipeline.md.