Tokenization

Tokenization is the process of breaking text into the units a model actually reads. Those units are called tokens, and they are often smaller than full words. A model may treat a long word as several tokens, punctuation as separate tokens, and code or multilingual text in ways that differ from ordinary reading.

Why Tokenization Matters

Many practical questions in modern AI depend on tokens rather than characters or pages. Usage limits, context size, latency, and price are often measured in tokens. That is why a document that seems short to a person may still consume a large amount of model context if it contains dense formatting, code, tables, or complex terminology.

Tokenization also affects behavior. Different token boundaries can influence how well a model handles names, rare words, code snippets, long numbers, or different languages. It is one of the reasons that the same idea can behave differently depending on phrasing.

How It Connects to Larger Systems

Tokenization is deeply tied to the context window, because every instruction, user message, retrieved document, and generated answer has to fit inside a token budget. It also matters for LLMs because next-token prediction is the task they are trained to solve.

For readers learning AI, tokenization is a useful reminder that language models do not see language the way people do. They operate on encoded pieces of text, not on human concepts directly, even when the final behavior feels conversational.