concept

Audio tokens

Audio tokens are the discrete units LLMs use to represent audio in multimodal models — input speech → audio tokens, model processes them, output audio tokens → speech. The new pricing dimension in 2026 Realtime APIs.

Text LLMs use ~1 token per ~4 characters. Audio LLMs (GPT-4o, Gemini, Voicebox, Whisper-style) tokenize audio: encode raw audio into discrete tokens (e.g., 12.5-25 tokens per second of speech), process them alongside text, decode output tokens back to audio. Pricing implications: audio tokens cost materially more than text tokens (typically 5-50× per token), but a minute of speech is only ~750-1500 audio tokens — comparable to a paragraph of text. Production gotchas: audio token budget for long sessions adds up fast; bilingual + dialect coverage varies by model; voice cloning into audio token space is harder than text-to-speech. Audio tokenization is the technical foundation that unlocked Realtime APIs in 2024-2025.

When to use audio tokens

Common mistakes

FAQ

What is audio tokens?

Audio tokens are the discrete units LLMs use to represent audio in multimodal models — input speech → audio tokens, model processes them, output audio tokens → speech. The new pricing dimension in 2026 Realtime APIs.

When should I use audio tokens?

Pricing realtime voice apps. Multimodal audio + text reasoning.

What are the most common mistakes with audio tokens?

Pricing voice apps off text token cost — undershoots by 5-10×.

Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/audio-tokens.md.