Voice activity detection (VAD)
Voice activity detection is the lightweight signal-processing step that determines whether incoming audio contains speech — used to start STT, trigger barge-in, and gate microphone use in voice agents.
VAD runs continuously on the user's microphone audio, classifying each ~20ms frame as speech or non-speech. Used to start STT when the user begins talking, end STT when they stop (with a configurable silence threshold), trigger barge-in mid-response, and avoid sending silent audio to expensive STT APIs. Modern VAD models (Silero VAD, WebRTC VAD, Picovoice Cobra) are extremely lightweight — single-digit milliseconds per frame, runnable on-device. Best practice: pair VAD with an STT model's own end-of-utterance detection for robust turn-taking.
When to use voice activity detection (vad)
- Any realtime voice agent.
- Battery-sensitive on-device voice apps.
Common mistakes
- Tuning VAD too aggressive — clips the start of user speech.
- Treating VAD as a hard signal — pair with model-level end-of-utterance detection.
FAQ
What is voice activity detection (vad)?
Voice activity detection is the lightweight signal-processing step that determines whether incoming audio contains speech — used to start STT, trigger barge-in, and gate microphone use in voice agents.
When should I use voice activity detection (vad)?
Any realtime voice agent. Battery-sensitive on-device voice apps.
What are the most common mistakes with voice activity detection (vad)?
Tuning VAD too aggressive — clips the start of user speech. Treating VAD as a hard signal — pair with model-level end-of-utterance detection.
Related terms
- Voice (LLM apps) — Voice in LLM apps refers to the full speech pipeline — speech-to-text (STT), language model, text-to-speech (TTS) — that lets users converse with an AI assistant in spoken language.
- Streaming STT — Streaming STT (speech-to-text) emits partial transcriptions as the user speaks — instead of waiting for end-of-utterance — enabling sub-second response from a voice assistant.
- Barge-in — Barge-in is the voice-agent feature where the user can interrupt the assistant mid-response — the assistant detects the speech and stops talking — making conversations feel natural instead of robotic turn-taking.
- AI agent — An AI agent is a system where a language model autonomously plans and executes a sequence of tool calls to accomplish a goal.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/vad.md.