W_Q, W_K, W_V started as random numbers. Here's how they became meaningful.
Throughout Parts 1 to 5, we used weight matrices — W_Q, W_K, W_V, W_O, W₁, W₂ — without ever explaining where they came from. We just assumed they were meaningful. They aren't, at least not at the beginning.
Before training, every weight is a random number. W_Q doesn't know what a "query" means. W_K doesn't know what "knowledge" to encode. W_V doesn't know what "content" to extract. They're just noise.
Training is the process of turning those random numbers into meaningful ones.
What does "meaningful" mean? Simple: a weight is meaningful if it helps the model predict the next word correctly. That's the only criterion. No one hand-crafts the values of W_Q. No one tells W_K to encode "knowledge of street-stall restaurants." These structures emerge purely from one repeated question: given these tokens, what comes next?
After enough repetitions, the weights that minimize prediction error turn out to be exactly the weights that encode grammar, meaning, facts, and reasoning. Not because anyone designed them that way — because that's what accurate next-word prediction requires.
Training is nothing more than this: adjust W_Q, W_K, W_V, W_O, W₁, W₂ — and every other parameter — until the model's predictions are good enough to be useful.
Let's take stock of every parameter introduced across Parts 1 to 5.
Every decoder layer has its own independent copy of W_Q, W_K, W_V, W_O, W₁, W₂. Layer 1's W_Q has nothing to do with Layer 12's W_Q — completely separate matrices, learned separately, encoding different levels of abstraction.
| Model | Layers | d_model | Total parameters |
|---|---|---|---|
| GPT-2 small | 12 | 768 | ~117M |
| GPT-2 large | 36 | 1280 | ~774M |
| GPT-3 | 96 | 12288 | ~175B |
Every single one of those 175 billion numbers starts as a random value. Training adjusts all of them — simultaneously, in every step — guided by one signal: how wrong was the prediction?
Before we can measure how wrong the model is, it needs to make a prediction. That's the forward pass — everything we built in Parts 1 to 5, run in sequence.
In our example: Frank, Emma, and Daniel have spoken. The model predicts what Alex will say next.
At the end, x^{(L)} is Alex's final output vector. We project it into vocabulary space and apply softmax:
Suppose the model produces:
The correct answer is "spicy" — the model assigned it 0.45. The forward pass is purely mechanical — no learning happens here. Learning only begins after we measure how wrong that prediction was.
GPT uses cross-entropy loss to measure how wrong the prediction was:
Take the probability the model assigned to the correct answer, take its log, negate it. The better the model, the higher P(correct), the smaller the loss.
| P(correct token) | Loss = −log P | What happened |
|---|---|---|
| 0.02 | 3.912 | confident about the wrong answer — very high loss |
| 0.45 | 0.799 | our current prediction — moderate loss |
| 0.80 | 0.223 | mostly right — low loss |
| 0.99 | 0.010 | almost certain — near-zero loss |
The loss is computed for every token position — Frank, Emma, Daniel, and Alex all make predictions. The average loss over all positions is what gets passed to backpropagation.
The loss is a single number. It tells us the model was wrong — but not which weights caused the error, or how to fix them. That's what the backward pass does.
For every weight W, we want to know:
This is the gradient. Positive gradient: this weight is pushing the loss up, decrease it. Negative gradient: this weight is pushing the loss down, increase it. Large magnitude: big influence, move it more. Small magnitude: barely matters, move it less.
The forward pass is a long chain of operations. The chain rule says: to find how any weight W affects the final loss, multiply together how each step in the chain affects the next:
Backpropagation applies this chain rule automatically, starting from the loss and flowing backwards through every operation.
At the output softmax, the gradient has a clean form — predicted probabilities minus the one-hot correct answer:
The −0.55 at "spicy": push this logit higher — not confident enough about the right answer. Positive values elsewhere: push these logits lower — too much probability on wrong answers.
Suppose after backpropagation, one entry in W_Q¹ has gradient +0.023. If we increase this weight by 0.001, the loss increases by 0.000023. It is making the prediction worse. The fix: decrease it.
Every weight now has a gradient. The update rule:
\(\eta\) is the learning rate — typically 0.0001 to 0.00003. The W_Q¹ entry with gradient +0.023 and η = 0.0001:
A tiny nudge. But this happens simultaneously for every parameter — all 175 billion, each nudged in its own loss-reducing direction.
Why such a small learning rate? The gradient tells us which direction is downhill right now — but only locally. A large step might overshoot the valley. A small step follows the slope carefully, making slow but reliable progress.
Adam, not vanilla gradient descent. Real transformers use Adam (Adaptive Moment Estimation), which adjusts the effective learning rate per parameter based on the history of past gradients. Parameters nudged consistently in the same direction get a larger step. Parameters with noisy gradients get a smaller step. Same core idea, but faster and more stable.
What happens over the course of training. Early on, the weights are random and the loss is high. Gradients are large, weights move quickly. After thousands of steps, basic patterns emerge. After tens of thousands, the loss slows its descent — the weights are refining, not discovering. By the end, the loss has converged.
No one told W_Q to detect "street-stall craving." No one told W_K to encode "knowledge of cheap places." These structures emerged from gradient descent alone — billions of small nudges, each one saying "this weight was slightly wrong, adjust it this much."
The model never saw a label that said "this is a noun" or "this is a causal relationship." It only ever saw: given these tokens, what comes next? Across enough text, the weights that answer that question well are the weights that understand language. Next-token prediction is a deceptively simple objective. The weights it produces are not simple at all.
W_Q, W_K, W_V started as random noise. Every training step asked one question: given these tokens, what comes next? Every wrong answer sent a gradient signal backwards through the entire network, nudging every weight by a fraction of a percent.
Billions of nudges later, the weights that minimize next-token prediction loss turned out to be the weights that understand language. Not because anyone designed them that way — because that's what the math converged to.