Tokenization

Also known as: tokenizer, tokens, subword tokenization, BPE

technical intermediate

What is Tokenization?

Tokenization is the process of breaking text into smaller units called tokens that a language model can process. Tokens are not necessarily whole words — they can be subwords, individual characters, or even byte sequences, depending on the tokenizer. For example, the word “unbelievable” might be split into “un”, “believ”, and “able.” Most modern LLMs use subword tokenization algorithms like Byte-Pair Encoding (BPE), which balance vocabulary size against the ability to represent any text, including rare words, code, and non-English languages.

How Tokenization Affects AI Systems

Tokenization has direct practical implications for anyone working with LLMs. Context windows are measured in tokens, not words — a 200,000-token context window holds roughly 150,000 English words but fewer tokens of code or non-Latin scripts. API pricing is per-token for both input and output. The tokenizer determines what the model “sees”: languages and scripts that tokenize inefficiently (requiring more tokens per word) are inherently more expensive to process and may receive lower-quality responses because less semantic content fits in the same context window. Even within English, unusual formatting, code syntax, or specialized terminology may tokenize less efficiently than common prose.

Why Tokenization Matters for Practitioners

Understanding tokenization helps practitioners make better decisions about cost, context management, and system design. Knowing that a document consumes 50,000 tokens versus 10,000 tokens directly impacts whether it fits in context, how much an API call costs, and how much room remains for instructions and reasoning. Tools like tokenizer playgrounds (available from OpenAI, Anthropic, and others) let you inspect exactly how text is tokenized. When building production systems, token counting is essential for managing context windows, estimating costs, chunking documents for RAG, and setting appropriate limits on user inputs.

  • Pre-training - Tokenization is the first step in processing training data

Related Terms

See Also