# Audio tokens

**Source:** https://promtable.com/glossary/audio-tokens

> Audio tokens are the discrete units LLMs use to represent audio in multimodal models — input speech → audio tokens, model processes them, output audio tokens → speech. The new pricing dimension in 2026 Realtime APIs.

---
Audio tokens are the discrete units LLMs use to represent audio in multimodal models — input speech → audio tokens, model processes them, output audio tokens → speech. The new pricing dimension in 2026 Realtime APIs.

Text LLMs use ~1 token per ~4 characters. Audio LLMs (GPT-4o, Gemini, Voicebox, Whisper-style) tokenize audio: encode raw audio into discrete tokens (e.g., 12.5-25 tokens per second of speech), process them alongside text, decode output tokens back to audio. Pricing implications: audio tokens cost materially more than text tokens (typically 5-50× per token), but a minute of speech is only ~750-1500 audio tokens — comparable to a paragraph of text. Production gotchas: audio token budget for long sessions adds up fast; bilingual + dialect coverage varies by model; voice cloning into audio token space is harder than text-to-speech. Audio tokenization is the technical foundation that unlocked Realtime APIs in 2024-2025.

## When to use

- Pricing realtime voice apps.
- Multimodal audio + text reasoning.

## Common mistakes

- Pricing voice apps off text token cost — undershoots by 5-10×.

## Related terms

- [realtime-api](https://promtable.com/glossary/realtime-api)
- [voice](https://promtable.com/glossary/voice)
- [multimodal](https://promtable.com/glossary/multimodal)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/audio-tokens
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/audio-tokens".
Contact: info@vibecodingturkey.com.