Audio tokens
Audio tokens are the discrete units LLMs use to represent audio in multimodal models — input speech → audio tokens, model processes them, output audio tokens → speech. The new pricing dimension in 2026 Realtime APIs.
Text LLMs use ~1 token per ~4 characters. Audio LLMs (GPT-4o, Gemini, Voicebox, Whisper-style) tokenize audio: encode raw audio into discrete tokens (e.g., 12.5-25 tokens per second of speech), process them alongside text, decode output tokens back to audio. Pricing implications: audio tokens cost materially more than text tokens (typically 5-50× per token), but a minute of speech is only ~750-1500 audio tokens — comparable to a paragraph of text. Production gotchas: audio token budget for long sessions adds up fast; bilingual + dialect coverage varies by model; voice cloning into audio token space is harder than text-to-speech. Audio tokenization is the technical foundation that unlocked Realtime APIs in 2024-2025.
When to use audio tokens
- Pricing realtime voice apps.
- Multimodal audio + text reasoning.
Common mistakes
- Pricing voice apps off text token cost — undershoots by 5-10×.
FAQ
What is audio tokens?
Audio tokens are the discrete units LLMs use to represent audio in multimodal models — input speech → audio tokens, model processes them, output audio tokens → speech. The new pricing dimension in 2026 Realtime APIs.
When should I use audio tokens?
Pricing realtime voice apps. Multimodal audio + text reasoning.
What are the most common mistakes with audio tokens?
Pricing voice apps off text token cost — undershoots by 5-10×.
Related terms
- Realtime API — A Realtime API is the WebSocket / WebRTC-based LLM endpoint that supports streaming audio in + audio out for natural duplex conversation — OpenAI Realtime API, Gemini Live, ElevenLabs Conversational, Cartesia Sonic are 2026 leaders.
- Voice (LLM apps) — Voice in LLM apps refers to the full speech pipeline — speech-to-text (STT), language model, text-to-speech (TTS) — that lets users converse with an AI assistant in spoken language.
- Multimodal model — A multimodal model accepts more than one input type — text plus images, audio, or video — and reasons across them in a single forward pass.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/audio-tokens.md.