concept

Voice pipeline

A voice pipeline is the chain of audio processing stages — VAD → STT → LLM → TTS → playback — composed in a streaming framework like Pipecat, LiveKit Agents, Vocode, or a managed platform's internal stack.

Voice agents require multiple stages: voice activity detection ([[vad]]) detects when the user is speaking, [[streaming-stt]] converts speech to text incrementally, the LLM reasons + generates response (often with tools), [[neural-tts]] streams the response back as audio, and audio playback handles the actual sound output. Each stage has streaming + non-streaming variants; sub-second latency requires streaming throughout. Frameworks like Pipecat express this as composable 'frames' flowing between processors; LiveKit Agents uses pluggable nodes; managed platforms (Vapi, Retell) hide the pipeline behind a dashboard. Production tuning: balance latency (more streaming = lower latency, more complexity) vs quality (per-stage smarts vs simple chaining).

When to use voice pipeline

Common mistakes

FAQ

What is voice pipeline?

A voice pipeline is the chain of audio processing stages — VAD → STT → LLM → TTS → playback — composed in a streaming framework like Pipecat, LiveKit Agents, Vocode, or a managed platform's internal stack.

When should I use voice pipeline?

Building voice agents from scratch.

What are the most common mistakes with voice pipeline?

Non-streaming TTS — adds full sentence latency before playback starts. Skipping VAD — model talks over the user.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/voice-pipeline.md.