Tokenizer
A tokenizer is the algorithm that splits text into the tokens a language model actually reads — BPE, SentencePiece, or tiktoken in 2026.
Different models use different tokenizers (cl100k_base for GPT-4 and o-series, o200k_base for GPT-4o, custom tokenizers for Claude, Gemini, Llama). The same string can tokenize to wildly different counts across models, which directly affects price and context use. Non-Latin scripts (Turkish, Arabic, Hindi, Chinese) tokenize more densely than English, sometimes 1.5–2× more tokens per word — a hidden multiplier on localised products. Use a per-model tokenizer when estimating cost, never a generic word-to-token ratio.
Common mistakes
- Estimating cost with 'words × 1.3' instead of the real tokenizer.
- Ignoring tokenization differences when porting prompts across models.
FAQ
What is tokenizer?
A tokenizer is the algorithm that splits text into the tokens a language model actually reads — BPE, SentencePiece, or tiktoken in 2026.
What are the most common mistakes with tokenizer?
Estimating cost with 'words × 1.3' instead of the real tokenizer. Ignoring tokenization differences when porting prompts across models.
Related terms
- Token — A token is the smallest unit a language model reads or writes — typically a sub-word fragment, with one English word averaging about 1.3 tokens.
- Context window — The context window is the maximum number of tokens — system prompt, conversation history, retrieved documents, and the response — that a language model can process in a single turn.
Last updated: 2026-06-01. Raw markdown: https://promtable.com/glossary/tokenizer.md.