concept

Voice activity detection (VAD)

Voice activity detection is the lightweight signal-processing step that determines whether incoming audio contains speech — used to start STT, trigger barge-in, and gate microphone use in voice agents.

VAD runs continuously on the user's microphone audio, classifying each ~20ms frame as speech or non-speech. Used to start STT when the user begins talking, end STT when they stop (with a configurable silence threshold), trigger barge-in mid-response, and avoid sending silent audio to expensive STT APIs. Modern VAD models (Silero VAD, WebRTC VAD, Picovoice Cobra) are extremely lightweight — single-digit milliseconds per frame, runnable on-device. Best practice: pair VAD with an STT model's own end-of-utterance detection for robust turn-taking.

When to use voice activity detection (vad)

Common mistakes

FAQ

What is voice activity detection (vad)?

Voice activity detection is the lightweight signal-processing step that determines whether incoming audio contains speech — used to start STT, trigger barge-in, and gate microphone use in voice agents.

When should I use voice activity detection (vad)?

Any realtime voice agent. Battery-sensitive on-device voice apps.

What are the most common mistakes with voice activity detection (vad)?

Tuning VAD too aggressive — clips the start of user speech. Treating VAD as a hard signal — pair with model-level end-of-utterance detection.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/vad.md.