LLMs · Understanding GPT-2 Part 2
Why GPT-2 Cannot Read Text Directly - Part 2
An explanation on why tokenization is important in Language modeling
Read the previous post: The Language Modeling Problem - Part 1
Why GPT-2 Cannot Read Text Directly
Humans see words, sentences, and paragraphs. Neural networks see numbers.
That creates the first practical problem in language modeling:
How do we convert raw text into something a neural network can process?
GPT-2 solves this using tokenization.
What is a token?
A token is a small unit of text represented by an integer ID.
For example, the sentence:
Hello, I am learning GPT-2.
might become something like:
[15496, 11, 314, 716, 4673, 402, 11571, 12, 17]
The model does not directly process the characters H, e, l, l, o. It processes token IDs.
Each token ID is later converted into a vector through an embedding table.
Why not use words?
A simple idea is to split text by spaces and treat each word as a token. But this creates problems.
First, language has many rare words. Names, misspellings, technical terms, and new words appear constantly. A word-level tokenizer would need a huge vocabulary.
Second, words have related forms:
run
running
runner
runs
A word-level tokenizer treats these as completely separate tokens unless the model learns the relationship from data.
Third, not all languages use spaces the same way. A simple whitespace tokenizer is too fragile.
Why not use characters?
Character-level tokenization avoids unknown words. Every word can be represented as characters.
But character sequences are long. The word:
internationalization
would require many character tokens. Long sequences are expensive for Transformers because attention cost grows with sequence length.
So word-level tokens are too coarse, and character-level tokens are too long.
Subword tokenization
GPT-2 uses a subword tokenization approach based on byte-pair encoding.
Subword tokenization sits between words and characters.
Common words may become one token:
computer
Rare words may be split into pieces:
bioinformatics -> bio + inform + atics
This gives a good tradeoff:
- Frequent patterns are represented efficiently.
- Rare words can still be constructed from smaller pieces.
- The vocabulary remains manageable.
GPT-2 vocabulary
GPT-2 uses a vocabulary of 50,257 tokens.
That number includes common text pieces, byte-level tokens, and a special end-of-text token.
The tokenizer maps text into token IDs. The model then maps token IDs into vectors.
Why tokenization matters
Tokenization affects almost everything:
- Sequence length
- Training cost
- Context window usage
- How the model handles rare words
- How the model handles code
- How the model handles multilingual text
- How many output classes the model predicts
A model with a 50,257-token vocabulary must output a probability distribution over 50,257 possible next tokens at every position.
So tokenization is not a preprocessing detail. It directly shapes the learning problem.
Tokenization and compression
A good tokenizer compresses text into fewer tokens while preserving useful structure.
If a tokenizer turns a paragraph into fewer tokens, the model can fit more content into the same context window. That means more usable context for the same computational budget.
This is why modern LLMs still care deeply about tokenizer design.
Key takeaway
The second problem GPT-2 or any other LLM solves is the text-to-number problem.
Raw text must become token IDs before it can become model input.
Tokenization is the bridge between human language and neural computation.