Voice dictation
Voice dictation is the modern AI-augmented version of speech-to-text — hold a hotkey, speak, the LLM transcribes + cleans up + inserts. Wispr Flow, Superwhisper, MacWhisper, BetterDictation are 2026 leaders, replacing macOS / Windows system dictation.
Pre-AI dictation was painful: low accuracy, awkward formatting (saying 'comma' and 'new line'), no domain awareness. Modern voice dictation uses Whisper-class STT + LLM post-processing: speak naturally, the LLM auto-formats (punctuation, paragraphs, lists), expands abbreviations, fixes filler words, and even runs commands ('translate this and email it'). Production use: developers writing commit messages + comments, knowledge workers drafting emails + docs, creators writing rough scripts. Trade-offs: cloud-based versions need network + privacy trust, local versions need a beefier Mac. Latency target: < 2s from finish speaking to inserted text. By 2026 voice dictation is mainstream productivity tooling, not accessibility niche.
When to use voice dictation
- Drafting long-form text faster than typing.
- Mobile / on-the-go capture.
- Accessibility (RSI, etc.).
Common mistakes
- Treating dictation as final — review for hallucinated proper nouns + numbers.
- Speaking too fast / quietly — quality collapses on low-SNR audio.
FAQ
What is voice dictation?
Voice dictation is the modern AI-augmented version of speech-to-text — hold a hotkey, speak, the LLM transcribes + cleans up + inserts. Wispr Flow, Superwhisper, MacWhisper, BetterDictation are 2026 leaders, replacing macOS / Windows system dictation.
When should I use voice dictation?
Drafting long-form text faster than typing. Mobile / on-the-go capture. Accessibility (RSI, etc.).
What are the most common mistakes with voice dictation?
Treating dictation as final — review for hallucinated proper nouns + numbers. Speaking too fast / quietly — quality collapses on low-SNR audio.
Related terms
- Streaming STT — Streaming STT (speech-to-text) emits partial transcriptions as the user speaks — instead of waiting for end-of-utterance — enabling sub-second response from a voice assistant.
- Voice (LLM apps) — Voice in LLM apps refers to the full speech pipeline — speech-to-text (STT), language model, text-to-speech (TTS) — that lets users converse with an AI assistant in spoken language.
- Voice activity detection (VAD) — Voice activity detection is the lightweight signal-processing step that determines whether incoming audio contains speech — used to start STT, trigger barge-in, and gate microphone use in voice agents.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/voice-dictation.md.