# Voice activity detection (VAD)

**Source:** https://promtable.com/glossary/vad

> Voice activity detection is the lightweight signal-processing step that determines whether incoming audio contains speech — used to start STT, trigger barge-in, and gate microphone use in voice agents.

---
Voice activity detection is the lightweight signal-processing step that determines whether incoming audio contains speech — used to start STT, trigger barge-in, and gate microphone use in voice agents.

VAD runs continuously on the user's microphone audio, classifying each ~20ms frame as speech or non-speech. Used to start STT when the user begins talking, end STT when they stop (with a configurable silence threshold), trigger barge-in mid-response, and avoid sending silent audio to expensive STT APIs. Modern VAD models (Silero VAD, WebRTC VAD, Picovoice Cobra) are extremely lightweight — single-digit milliseconds per frame, runnable on-device. Best practice: pair VAD with an STT model's own end-of-utterance detection for robust turn-taking.

## When to use

- Any realtime voice agent.
- Battery-sensitive on-device voice apps.

## Common mistakes

- Tuning VAD too aggressive — clips the start of user speech.
- Treating VAD as a hard signal — pair with model-level end-of-utterance detection.

## Related terms

- [voice](https://promtable.com/glossary/voice)
- [streaming-stt](https://promtable.com/glossary/streaming-stt)
- [barge-in](https://promtable.com/glossary/barge-in)
- [agent](https://promtable.com/glossary/agent)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/vad
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/vad".
Contact: info@vibecodingturkey.com.