When a language model reads your message, it doesn't see letters or words in the way a human would. It sees tokens — small chunks of text that are the model's actual unit of currency. 'Unbelievable' might be one token or three, depending on how common it is in the training data. 'The' is almost certainly one. A space before a word is sometimes folded into its token, sometimes not. The exact mapping is determined by the model's tokenizer, a separate component trained to break any string of text into these units efficiently.
Why not just use words? Because language is messier than dictionaries suggest. New words appear; names are novel; languages mix and blend. Tokens — typically averaging around three to four characters in English — offer a practical middle ground between letters (too granular) and words (too brittle for edge cases). They let the model handle any text it encounters, including things it has never seen before.
The number of tokens in a piece of text has practical consequences. Model providers typically charge by token. Context windows — the amount of text a model can hold in working memory at once — are measured in tokens. A typical novel is around 100,000 to 120,000 tokens. A long conversation can fill a context window faster than expected.
Knowing this makes the probability cloud metaphor more precise. When a language model works from a probability distribution over possible words, what that means more exactly is: over possible tokens. The cloud resolves one token at a time. Each choice shapes the distribution for the next. The sentence you are reading emerged from a chain of those collapses.