# Streaming STT

**Source:** https://promtable.com/glossary/streaming-stt

> Streaming STT (speech-to-text) emits partial transcriptions as the user speaks — instead of waiting for end-of-utterance — enabling sub-second response from a voice assistant.

---
Streaming STT (speech-to-text) emits partial transcriptions as the user speaks — instead of waiting for end-of-utterance — enabling sub-second response from a voice assistant.

Production voice agents in 2026 require streaming STT. Models like Deepgram Nova-3, AssemblyAI Universal-Streaming, and OpenAI Whisper-3 emit partial transcripts every ~100-200ms. The agent can begin the LLM call before the user finishes speaking (if VAD signals end-of-utterance with high confidence) or use the streaming text to begin retrieval / classification. End-to-end voice agent latency targets in 2026 are <800ms from end-of-speech to start-of-agent-speech; without streaming STT this is unachievable.

## When to use

- Realtime voice agents.
- Live captioning.
- Interactive voice UX.

## Common mistakes

- Using batch STT for realtime use — instant latency penalty.
- Ignoring word-error rate at the cost of latency — quality matters too.

## Related terms

- [voice](https://promtable.com/glossary/voice)
- [agent](https://promtable.com/glossary/agent)
- [response-streaming](https://promtable.com/glossary/response-streaming)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/streaming-stt
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/streaming-stt".
Contact: info@vibecodingturkey.com.