How the Weights Learn

W_Q, W_K, W_V started as random numbers. Here's how they became meaningful.

1. What is training?

Throughout Parts 1 to 5, we used weight matrices — W_Q, W_K, W_V, W_O, W₁, W₂ — without ever explaining where they came from. We just assumed they were meaningful. They aren't, at least not at the beginning.

Before training, every weight is a random number. W_Q doesn't know what a "query" means. W_K doesn't know what "knowledge" to encode. W_V doesn't know what "content" to extract. They're just noise.

Training is the process of turning those random numbers into meaningful ones.

What does "meaningful" mean? Simple: a weight is meaningful if it helps the model predict the next word correctly. That's the only criterion. No one hand-crafts the values of W_Q. No one tells W_K to encode "knowledge of street-stall restaurants." These structures emerge purely from one repeated question: given these tokens, what comes next?

Show the model some text

A sequence of tokens from the training data

Ask it to predict the next word

Run the full forward pass — Parts 1–5

Check how wrong it was

Compute loss — a single number measuring the error

Nudge every weight slightly

In the direction that makes the prediction less wrong

Repeat billions of times

GPT-3 trained on 300 billion tokens

After enough repetitions, the weights that minimize prediction error turn out to be exactly the weights that encode grammar, meaning, facts, and reasoning. Not because anyone designed them that way — because that's what accurate next-word prediction requires.

Training is nothing more than this: adjust W_Q, W_K, W_V, W_O, W₁, W₂ — and every other parameter — until the model's predictions are good enough to be useful.

2. The parameters we're adjusting

Let's take stock of every parameter introduced across Parts 1 to 5.

In each attention layer

W_Q, W_K, W_Vproject X′ into query, key, and value space — Parts 1, 2

W_Oproject concatenated multi-head output back to d_model — Part 2

In each FFN layer

W₁, W₂expand and compress each token's representation — Part 5

b₁, b₂bias terms — shift the activation threshold of each neuron

Shared across the model

W_embedconverts token ids into vectors — the embedding matrix

W_vocabprojects the final layer's output into vocabulary space

Ppositional encoding — fixed, not learned in the original Transformer

Every decoder layer has its own independent copy of W_Q, W_K, W_V, W_O, W₁, W₂. Layer 1's W_Q has nothing to do with Layer 12's W_Q — completely separate matrices, learned separately, encoding different levels of abstraction.

Model	Layers	d_model	Total parameters
GPT-2 small	12	768	~117M
GPT-2 large	36	1280	~774M
GPT-3	96	12288	~175B

Every single one of those 175 billion numbers starts as a random value. Training adjusts all of them — simultaneously, in every step — guided by one signal: how wrong was the prediction?

3. Forward pass — making a prediction

Before we can measure how wrong the model is, it needs to make a prediction. That's the forward pass — everything we built in Parts 1 to 5, run in sequence.

In our example: Frank, Emma, and Daniel have spoken. The model predicts what Alex will say next.

\[ X \xrightarrow{+P} X' \xrightarrow{W_Q,W_K,W_V} \text{masked attn} \xrightarrow{+\text{res}} \xrightarrow{\text{LN}} \xrightarrow{W_1,W_2} \text{FFN} \xrightarrow{+\text{res}} \xrightarrow{\text{LN}} x^{(1)} \xrightarrow{\cdots} x^{(L)} \]

At the end, x^{(L)} is Alex's final output vector. We project it into vocabulary space and apply softmax:

\[ \text{logits} = x^{(L)} \cdot W_{\text{vocab}}, \qquad P(\text{token}_i) = \frac{e^{\text{logit}_i}}{\sum_j e^{\text{logit}_j}} \]

Suppose the model produces:

\[ P = \begin{bmatrix} \text{spicy} & \text{cheap} & \text{fast} & \text{cozy} & \text{flavor} & \text{vibe} \\ 0.45 & 0.30 & 0.15 & 0.05 & 0.03 & 0.02 \end{bmatrix} \]

The correct answer is "spicy" — the model assigned it 0.45. The forward pass is purely mechanical — no learning happens here. Learning only begins after we measure how wrong that prediction was.

4. Loss — how wrong were we?

GPT uses cross-entropy loss to measure how wrong the prediction was:

\[ \mathcal{L} = -\log P(\text{correct token}) \]

Take the probability the model assigned to the correct answer, take its log, negate it. The better the model, the higher P(correct), the smaller the loss.

P(correct token)	Loss = −log P	What happened
0.02	3.912	confident about the wrong answer — very high loss
0.45	0.799	our current prediction — moderate loss
0.80	0.223	mostly right — low loss
0.99	0.010	almost certain — near-zero loss

The loss is computed for every token position — Frank, Emma, Daniel, and Alex all make predictions. The average loss over all positions is what gets passed to backpropagation.

5. Backward pass — gradients flow back

The loss is a single number. It tells us the model was wrong — but not which weights caused the error, or how to fix them. That's what the backward pass does.

The core question

For every weight W, we want to know:

\[ \frac{\partial \mathcal{L}}{\partial W} \quad \text{— if I increase this weight by a tiny amount, how much does the loss change?} \]

This is the gradient. Positive gradient: this weight is pushing the loss up, decrease it. Negative gradient: this weight is pushing the loss down, increase it. Large magnitude: big influence, move it more. Small magnitude: barely matters, move it less.

The chain rule

The forward pass is a long chain of operations. The chain rule says: to find how any weight W affects the final loss, multiply together how each step in the chain affects the next:

\[ \frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial P} \cdot \frac{\partial P}{\partial \text{logits}} \cdot \frac{\partial \text{logits}}{\partial x^{(L)}} \cdot \frac{\partial x^{(L)}}{\partial \cdots} \cdot \frac{\partial \cdots}{\partial W} \]

Backpropagation applies this chain rule automatically, starting from the loss and flowing backwards through every operation.

What the gradient looks like at the output

At the output softmax, the gradient has a clean form — predicted probabilities minus the one-hot correct answer:

\[ \nabla_{\text{logits}} = P - \mathbf{1}_{\text{correct}} = [-0.55,\ 0.30,\ 0.15,\ 0.05,\ 0.03,\ 0.02] \]

The −0.55 at "spicy": push this logit higher — not confident enough about the right answer. Positive values elsewhere: push these logits lower — too much probability on wrong answers.

How gradients flow back through the network

loss

↑

Starting point — ∂L/∂logits = [−0.55, 0.30, 0.15, 0.05, 0.03, 0.02]

W_vocab

↑

Flows back through logits = x⁽ᴸ⁾ · W_vocab — gradients for W_vocab and x⁽ᴸ⁾

LayerNorm

↑

Differentiable — gradient passes through, slightly redistributed across dimensions

residual

↑

Addition has gradient 1 in both branches — splits across FFN path and skip path simultaneously

W₁, W₂

↑

Through W₂ → ReLU (inactive neurons get zero gradient, no update) → W₁. Both accumulate gradients.

residual

↑

Splits again across the attention path and the skip path

W_Q,W_K,W_V

↑

Through A·V → softmax → QKᵀ/√d — produces gradients for W_Q, W_K, W_V

X′ = X + P

↑

P is fixed — gradient flows to X and to W_embed, which also gets updated

Suppose after backpropagation, one entry in W_Q¹ has gradient +0.023. If we increase this weight by 0.001, the loss increases by 0.000023. It is making the prediction worse. The fix: decrease it.

6. Gradient descent — update the weights

Every weight now has a gradient. The update rule:

\[ W \leftarrow W - \eta \cdot \frac{\partial \mathcal{L}}{\partial W} \]

\(\eta\) is the learning rate — typically 0.0001 to 0.00003. The W_Q¹ entry with gradient +0.023 and η = 0.0001:

\[ W \leftarrow W - 0.0001 \times 0.023 = W - 0.0000023 \]

A tiny nudge. But this happens simultaneously for every parameter — all 175 billion, each nudged in its own loss-reducing direction.

Why such a small learning rate? The gradient tells us which direction is downhill right now — but only locally. A large step might overshoot the valley. A small step follows the slope carefully, making slow but reliable progress.

Adam, not vanilla gradient descent. Real transformers use Adam (Adaptive Moment Estimation), which adjusts the effective learning rate per parameter based on the history of past gradients. Parameters nudged consistently in the same direction get a larger step. Parameters with noisy gradients get a smaller step. Same core idea, but faster and more stable.

7. One step, billions of times

Take a batch of sequences

A slice of the training data — millions of tokens at once

Forward pass

Compute predictions for every token position in the batch

Compute loss

Average cross-entropy over every position in the batch

Backward pass

Compute gradients for every weight via backpropagation

Update weights — W ← W − η · ∇W

Every parameter nudged in its loss-reducing direction

Repeat

GPT-3: 300 billion tokens, 175 billion parameters, hundreds of thousands of steps

What happens over the course of training. Early on, the weights are random and the loss is high. Gradients are large, weights move quickly. After thousands of steps, basic patterns emerge. After tens of thousands, the loss slows its descent — the weights are refining, not discovering. By the end, the loss has converged.

No one told W_Q to detect "street-stall craving." No one told W_K to encode "knowledge of cheap places." These structures emerged from gradient descent alone — billions of small nudges, each one saying "this weight was slightly wrong, adjust it this much."

The model never saw a label that said "this is a noun" or "this is a causal relationship." It only ever saw: given these tokens, what comes next? Across enough text, the weights that answer that question well are the weights that understand language. Next-token prediction is a deceptively simple objective. The weights it produces are not simple at all.

The full picture

init

Random initialization

All weights — W_Q, W_K, W_V, W_O, W₁, W₂ — start as random numbers

fwd

Forward pass — Parts 1–5

X′ → masked attention → residual → LN → FFN → residual → LN → × L layers → logits → P

Loss — −log P(correct token)

P = 0.45 → L = 0.799. A single number: how wrong was the prediction?

bwd

Backward pass — chain rule

∂L/∂W for every weight — flows back through all layers to W_Q, W_K, W_V

↓W

Weight update — W ← W − η · ∇W

Every weight nudged in the loss-reducing direction

×N

Repeat billions of times

Random weights → meaningful representations → language understanding

W_Q, W_K, W_V started as random noise. Every training step asked one question: given these tokens, what comes next? Every wrong answer sent a gradient signal backwards through the entire network, nudging every weight by a fraction of a percent.

Billions of nudges later, the weights that minimize next-token prediction loss turned out to be the weights that understand language. Not because anyone designed them that way — because that's what the math converged to.