A probability distribution is a way of representing not just what might happen, but how likely each possible outcome is. Think of it as a landscape of possibilities: some peaks are tall (very likely), some are low (possible but unlikely), and everything adds up to one complete picture of the space.
In the context of a language model, the distribution lives over vocabulary — every word or word-fragment the model knows gets a score representing how plausible it is as the next token, given everything that came before. The model doesn’t flip a coin between two options. It is, at every step, working with a full probability distribution over tens of thousands of candidates, and sampling from it.
This is why language models are not lookup tables. The same prompt, given twice, can produce different responses — because the model is sampling from a distribution, not retrieving a fixed answer. The distribution itself is deterministic given the same inputs; the sampling introduces the variation. A setting called ‘temperature’ controls how sharply the distribution is peaked: high temperature flattens it (more randomness), low temperature sharpens it (more predictable).
The concept matters beyond AI. A probability distribution is how statisticians describe any uncertain system — the weather, financial markets, election outcomes. The world rarely offers certainty; it offers distributions. Learning to think in distributions rather than fixed predictions is one of the more practically useful shifts available to anyone who reasons about complex things.
Essay
You Are a Probability Cloud. Don't Collapse Too Soon.
Six things success is made of, and what machine learning can quietly teach you about all of them.