The Language Modeling Problem: How GPT-2 Learns Without Manual Labels - Part 1

Read the next post: Why GPT-2 Cannot Read Text Directly - Part 2

Most machine learning problems begin with a dataset of inputs and labels. For image classification, the input is an image and the label might be cat or dog. For sentiment analysis, the input is a review and the label might be positive or negative.

Language modeling is different.

GPT-2 does not need humans to label every sentence. Instead, it learns from raw text by solving a simple but powerful task:

Given previous tokens, predict the next token.

This is called next-token prediction, and it is the foundation of GPT-style language models.

Why next-token prediction is powerful

At first, predicting the next token sounds too simple. But to do it well, the model must learn many things at once.

If the text is:

The capital of France is

then the next token is likely:

Paris

To predict this, the model needs factual knowledge.

If the text is:

She opened the umbrella because it started to

then the next token is likely related to rain. To predict this, the model needs common-sense reasoning.

If the text is:

The professor told the students that they

then the model must understand grammar and context to continue correctly.

So next-token prediction forces the model to learn grammar, facts, style, reasoning patterns, and world knowledge from text.

Self-supervision

This is called self-supervised learning because the labels come from the data itself.

For a sequence of tokens:

[t1, t2, t3, t4, t5]

we can create training examples automatically:

t1 -> t2
t1, t2 -> t3
t1, t2, t3 -> t4
t1, t2, t3, t4 -> t5

No human annotation is required. The next token is already present in the text.

This is one of the main reasons LLMs scale well. The internet contains huge amounts of raw text, and raw text can be turned into training data automatically.

The GPT-2 objective

GPT-2 is an autoregressive language model. “Autoregressive” means it predicts future tokens using only past tokens.

It models:

P(next token | previous tokens)

P(x_t \mid x_{<t})

During training, the model receives a context window and learns to assign high probability to the correct next token at every position.

The training loss is the negative log-likelihood of the correct next token:

\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t \mid x_{<t})

In other words, the loss function is cross-entropy. If the model assigns low probability to the correct token, the loss is high. If it assigns high probability to the correct token, the loss is low.

Why this creates a general-purpose model

The interesting part is that GPT-2 is not trained separately for translation, summarization, question answering, or writing. It is trained only to predict text.

But many tasks can be expressed as text continuation.

Question answering:

Question: What is photosynthesis?
Answer:

Summarization:

Article: ...
Summary:

Translation:

English: Good morning
French:

Because the model has learned patterns of text, it can often continue prompts in useful ways.

Why GPT-2 was important

GPT-2 showed that if we train a large enough Transformer on enough text, next-token prediction can produce surprisingly general behavior. It was not just a model for one narrow task. It was a model that could adapt to many tasks through prompting.

That idea became one of the foundations of modern LLMs.

Key takeaway

The first problem GPT-2 solves is the label problem. Instead of requiring manually labeled datasets, it turns raw text into its own supervision signal.

That is the power of language modeling:

Text becomes both the input and the label.