# Tokenizer

**Source:** https://promtable.com/glossary/tokenizer

> A tokenizer is the algorithm that splits text into the tokens a language model actually reads — BPE, SentencePiece, or tiktoken in 2026.

---
A tokenizer is the algorithm that splits text into the tokens a language model actually reads — BPE, SentencePiece, or tiktoken in 2026.

Different models use different tokenizers (cl100k_base for GPT-4 and o-series, o200k_base for GPT-4o, custom tokenizers for Claude, Gemini, Llama). The same string can tokenize to wildly different counts across models, which directly affects price and context use. Non-Latin scripts (Turkish, Arabic, Hindi, Chinese) tokenize more densely than English, sometimes 1.5–2× more tokens per word — a hidden multiplier on localised products. Use a per-model tokenizer when estimating cost, never a generic word-to-token ratio.

## Common mistakes

- Estimating cost with 'words × 1.3' instead of the real tokenizer.
- Ignoring tokenization differences when porting prompts across models.

## Related terms

- [token](https://promtable.com/glossary/token)
- [context-window](https://promtable.com/glossary/context-window)

*Last updated: 2026-06-01*
---

Original page: https://promtable.com/glossary/tokenizer
Maintained by Promtable (https://promtable.com). Content: CC BY 4.0. Cite as "Promtable — https://promtable.com/glossary/tokenizer".
Contact: info@vibecodingturkey.com.