concept

Streaming STT

Streaming STT (speech-to-text) emits partial transcriptions as the user speaks — instead of waiting for end-of-utterance — enabling sub-second response from a voice assistant.

Production voice agents in 2026 require streaming STT. Models like Deepgram Nova-3, AssemblyAI Universal-Streaming, and OpenAI Whisper-3 emit partial transcripts every ~100-200ms. The agent can begin the LLM call before the user finishes speaking (if VAD signals end-of-utterance with high confidence) or use the streaming text to begin retrieval / classification. End-to-end voice agent latency targets in 2026 are <800ms from end-of-speech to start-of-agent-speech; without streaming STT this is unachievable.

When to use streaming stt

Common mistakes

FAQ

What is streaming stt?

Streaming STT (speech-to-text) emits partial transcriptions as the user speaks — instead of waiting for end-of-utterance — enabling sub-second response from a voice assistant.

When should I use streaming stt?

Realtime voice agents. Live captioning. Interactive voice UX.

What are the most common mistakes with streaming stt?

Using batch STT for realtime use — instant latency penalty. Ignoring word-error rate at the cost of latency — quality matters too.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/streaming-stt.md.