Tokens
Tokens are the basic units of text that AI language models process, typically representing words, parts of words, or individual characters.
Tokens are the fundamental units that large language models use to read and generate text. A tokenizer breaks input text into these units before the model processes them. In English, a token roughly corresponds to about 3/4 of a word on average, so 100 tokens equals approximately 75 words. However, common words may be a single token while uncommon or technical words may be split into multiple tokens.
Tokenization varies between different AI models. For example, the word "unhappiness" might be split into "un", "happiness" or "un", "happi", "ness" depending on the tokenizer. Code, numbers, and non-English languages often require more tokens per character of text. Understanding tokenization is important because AI models have a maximum context window measured in tokens, and pricing is typically based on the number of input and output tokens processed.
Tokens directly impact both the cost and capability of AI applications. More tokens in the context window means the model can reference more information, but each token adds to processing time and cost. Efficient prompt engineering minimizes unnecessary tokens while maximizing the useful context provided to the model. Modern models support context windows ranging from 8,000 to over 1,000,000 tokens.
Real-World Examples
- •The sentence 'Hello, world!' is typically 4 tokens: 'Hello', ',', ' world', '!'
- •OpenAI's GPT-4o charges per million input and output tokens processed
- •Claude's context window of 200K tokens can hold approximately 150,000 words
- •A typical novel of 80,000 words uses roughly 100,000 tokens