The Decoder — Putting It All Together

Masked attention is only the first step. Each layer also thinks alone, stabilizes itself, and passes the result forward.


1. One layer, three steps

Everything we have built so far — positional encoding, masked attention — feeds into a single decoder layer. But masked attention is only the first of three steps inside that layer.

Masked Attention
Each token gathers from those who came before. Part 4.
Add & Norm
Add input back. Normalize. Twice — once after attention, once after FFN.
FFN (MLP)
Each token thinks alone. No communication — just computation.

More precisely, one decoder layer runs:

x₁ = LayerNorm( x + MaskedAttention(x) )
x₂ = LayerNorm( x₁ + FFN(x₁) )

Two sub-layers, each wrapped in the same pattern: do something, add the input back, normalize. The input x is X' — the position-aware feature matrix from Part 3. x₂ is the output that flows to the next decoder layer.

GPT-2 (small) stacks 12 of these layers. GPT-3 stacks 96. Each layer sees the output of the previous one — the first layer operates on raw positional embeddings, by layer 96 the representations have been refined 96 times.


2. Step 1 — Masked Self-Attention

We computed this in Part 4. The input is X', the output is:

\[ \text{Output}_{\text{masked}} = \begin{bmatrix} & \text{info-sichuan} & \text{info-budget} & \text{info-upscale} \\ \text{Frank} & 2.031 & 1.303 & 0.479 \\ \text{Emma} & 2.001 & 1.301 & 0.491 \\ \text{Daniel} & 1.540 & 1.302 & 0.912 \\ \text{Alex} & 2.030 & 1.303 & 0.481 \end{bmatrix} \]

Each row is a weighted blend of what that person was allowed to hear — only from people who spoke before them. Frank heard only himself. Alex heard mostly Frank.

In a real transformer, W_V is designed so the attention output has the same dimension as the input — d_model. Here our attention output has 3 dimensions while X' has 6. We treat the attention output as already projected back to 6 dimensions before the residual step — exactly as W_O does in Part 2.


3. Step 2 — Residual connection

Before we add the residual, let's trace every transformation Alex has been through — from raw features all the way through multi-head masked attention to the W_O projection:

Step 0
Raw X — original feature profile
[0.98, 0.95, 0.12, 0.97, 0.15, 0.08]
(spicy, cheap, cozy, fast, flavor, vibe)
Step 1
X′ = X + P₄ — after positional encoding
[0.223, 0.296, 0.305, 1.953, 0.159, 1.080]
Step 2
Q¹ = X′ W_Q¹ — Head 1 query
[2.589, 1.589]
(street-stall, upscale)
Step 3
K¹ = X′ W_K¹ — Head 1 key
[2.565, 1.577]
(knows-street-stall, knows-upscale)
Step 4
V¹ = X′ W_V¹ — Head 1 value
[2.290, 0.798, 1.635]
(info-sichuan, info-budget, info-upscale)
Step 5
Q² = X′ W_Q² — Head 2 query
[2.398, 1.739]
(budget₁, budget₂)
Step 6
K² = X′ W_K² — Head 2 key
[2.418, 1.717]
(knows_low, knows_high)
Step 7
V² = X′ W_V² — Head 2 value
[2.448, 1.750]
(price_info, exp_info)
Step 8
Head 1: S¹ → Mask → softmax → A¹
S¹ = Q¹_Alex · K'ᵀ / √2
→ Frank: (2.589×5.271 + 1.589×1.703) / √2 = 16.352 / 1.414 = 11.563
→ Emma: (2.589×2.789 + 1.589×2.538) / √2 = 11.254 / 1.414 = 7.956
→ Daniel: (2.589×0.931 + 1.589×3.908) / √2 = 8.620 / 1.414 = 6.096
→ Alex: (2.589×2.565 + 1.589×1.577) / √2 = 9.148 / 1.414 = 6.469
S¹_Alex = [11.563, 7.956, 6.096, 6.469]

After mask — Alex is pos=4, nothing blocked:
e^11.563=104800, e^7.956=2851, e^6.096=442, e^6.469=645, sum=108738
A¹_Alex = [0.964, 0.026, 0.004, 0.006] — 96.4% Frank
Step 9
Head 2: S² → Mask → softmax → A²
S² = [10.839, 7.587, 6.159, 6.210]
A² = [0.945, 0.037, 0.009, 0.009] — 94.5% Frank
Step 10
Output¹ = A¹ · V¹ — Head 1 output
0.964×[3.963,2.121,1.743] + 0.026×[2.308,1.114,2.427] + 0.004×[1.626,0.577,3.115] + 0.006×[2.290,0.798,1.635]
[3.901, 2.081, 1.766]
Step 11
Output² = A² · V² — Head 2 output
[4.989, 1.895]
Step 12
Concatenate both heads
[3.901, 2.081, 1.766, 4.989, 1.895]
Step 13
Concat · W_O — project back to 6-dim
[4.814, 7.463, 2.895, 4.667, 4.981, 4.371]
(spicy, cheap, cozy, fast, flavor, vibe)
Step 14
Residual: X′ + Attention Output
[0.223+4.864, 0.296+7.494, 0.305+2.881, 1.953+4.716, 0.159+5.013, 1.080+4.356]
= [5.087, 7.790, 3.186, 6.669, 5.172, 5.436]

Now the residual connection makes clear why it matters: X'_Alex = [0.223, 0.296, 0.305, 1.953, 0.159, 1.080] on its own is small and position-distorted. The attention output [4.814, 7.463, 2.895, 4.667, 4.981, 4.371] carries what Alex learned from the conversation. Adding them together preserves both — Alex's original identity and the new knowledge he gathered.

After masked attention, we add the original input back:

\[ x_1^{\text{pre-norm}} = X' + \text{Output}_{\text{masked}} \]

This is the residual connection — also called a skip connection. The input bypasses the sub-layer and gets added directly to its output.

Two reasons this matters:

Gradient flow. In a deep network, gradients can shrink to near-zero as they backpropagate through many layers — the vanishing gradient problem. The residual connection creates a shortcut: gradients can flow directly through the addition, bypassing the sub-layer entirely. This is what makes 96-layer networks trainable.

Preserving what you knew. Attention rewrites each person's representation based on what they heard. But not everything needs to be rewritten. The residual connection says: keep your original features, and add whatever attention taught you on top. You listened to Frank, but you didn't forget who you are.

For Alex:

\[ X'_{\text{Alex}} = [0.223,\ 0.296,\ 0.305,\ 1.953,\ 0.159,\ 1.080] \] \[ \text{Attention Output}_{\text{Alex}} = [4.814,\ 7.463,\ 2.895,\ 4.667,\ 4.981,\ 4.371] \] \[ x_{1,\text{Alex}}^{\text{pre-norm}} = [5.037,\ 7.759,\ 3.200,\ 6.620,\ 5.140,\ 5.451] \]

Alex's vector now contains both his original positional features and what he learned from multi-head masked attention. The addition is literal — the numbers stack.

For all four people:

\[ x_1^{\text{pre-norm}} = \begin{bmatrix} & d_0 & d_1 & d_2 & d_3 & d_4 & d_5 \\ \text{Frank} & 6.709 & 9.135 & 3.058 & 6.709 & 5.153 & 5.408 \\ \text{Emma} & 5.872 & 8.096 & 3.927 & 5.792 & 5.128 & 5.531 \\ \text{Daniel} & 4.189 & 2.655 & 4.089 & 4.889 & 6.817 & 8.093 \\ \text{Alex} & 5.037 & 7.759 & 3.200 & 6.620 & 5.140 & 5.451 \end{bmatrix} \]

4. Step 3 — Layer Normalization

After the residual addition, we normalize each token's vector independently:

\[ \text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta \]

where \(\mu\) is the mean of the vector, \(\sigma\) is the standard deviation, and \(\gamma, \beta\) are learned scale and shift parameters (initialized to 1 and 0). For simplicity we set \(\gamma = 1, \beta = 0\).

Why normalize?

After the residual addition, each token's vector can have values at very different scales. Consider Alex's vector before normalization:

\[ x_{1,\text{Alex}}^{\text{pre-norm}} = [5.037,\ 7.759,\ 3.200,\ 6.620,\ 5.140,\ 5.451] \]

d₁ = 7.790 is the largest, d₂ = 3.186 is the smallest — a difference of nearly 2.4×. Now send this into FFN's first layer x @ W₁. The contribution of each dimension is its value multiplied by its weight. No matter what W₁ learns, d₁ will always contribute ~4× more than d₂ simply because its value is ~4× larger. The weights can't fix a numerical imbalance that large. d₂'s information gets drowned out — not because it's unimportant, but because its value is small.

This is the problem: the magnitude of a value is overriding the significance of a weight. LayerNorm fixes this by pulling all dimensions to the same scale, so W₁'s weights — not the raw values — decide which dimensions matter.

How it works — Alex's full calculation:

Step 1 — Compute mean:

\[ \mu = \frac{5.037+7.759+3.200+6.620+5.140+5.451}{6} = \frac{33.207}{6} = 5.535 \]

Step 2 — Subtract mean from each dimension:

\[ [-0.498,\ 2.224,\ -2.335,\ 1.085,\ -0.395,\ -0.084] \]

All values now centered at 0. But 2.233 is still much larger than −0.121.

Step 3 — Square each deviation:

\[ [0.248,\ 4.946,\ 5.452,\ 1.177,\ 0.156,\ 0.007] \]

Step 4 — Average the squares (variance):

\[ \sigma^2 = \frac{0.248+4.946+5.452+1.177+0.156+0.007}{6} = \frac{11.986}{6} = 1.998 \]

Step 5 — Take the square root (standard deviation):

\[ \sigma = \sqrt{1.998} = 1.413 \]

The standard deviation 1.413 is the average spread of these six values — a single ruler that describes how much they vary.

Step 6 — Divide each deviation by σ:

\[ \frac{[-0.498,\ 2.224,\ -2.335,\ 1.085,\ -0.395,\ -0.084]}{1.413} = [-0.352,\ 1.574,\ -1.652,\ 0.768,\ -0.279,\ -0.059] \]

Every dimension is now expressed in units of "how many standard deviations from the mean." The relative ordering is unchanged — d₁ is still the highest, d₂ is still the lowest — but the absolute magnitudes are on a common scale. W₁'s weights can finally do their job fairly.

Why per-token? LayerNorm normalizes across the feature dimension — one token's values across all dimensions. BatchNorm normalizes across the batch dimension. For language models, LayerNorm is preferred because each token can be processed independently, regardless of batch size.

For all four people after LayerNorm:

\[ x_1 = \begin{bmatrix} & d_0 & d_1 & d_2 & d_3 & d_4 & d_5 \\ \text{Frank} & 0.368 & 1.678 & -1.605 & 0.368 & -0.473 & -0.335 \\ \text{Emma} & 0.119 & 1.908 & -1.446 & 0.054 & -0.480 & -0.156 \\ \text{Daniel} & -0.514 & -1.359 & -0.569 & -0.128 & 0.934 & 1.636 \\ \text{Alex} & -0.352 & 1.574 & -1.652 & 0.768 & -0.279 & -0.059 \end{bmatrix} \]

Each row now has mean 0 and standard deviation 1. Ready for FFN.


5. Step 4 — Feed-Forward Network (MLP)

The second sub-layer is a Feed-Forward Network — two linear transformations with a ReLU activation in between:

\[ \text{FFN}(x) = \text{ReLU}(x W_1 + b_1)\, W_2 + b_2 \]

Unlike attention, FFN processes each token independently. Every row of x₁ goes through the exact same two weight matrices, but separately — no token looks at any other token here.

What are b₁ and b₂? They are bias terms — one per neuron, added after the linear transformation. W₁ scales and rotates the input, but it always passes through the origin: if the input is a zero vector, the output is also zero. b₁ adds a fixed offset to each neuron, shifting the point at which ReLU activates. Without b₁, a neuron only fires when xW₁ > 0. With b₁, the threshold becomes xW₁ > −b₁ — the model can learn how easily each neuron triggers. b₂ does the same after the second layer. In our example we set b₁ = 0 and b₂ = 0 to keep the arithmetic clean and focus on the core mechanics. In a real trained model both are non-zero learned parameters.

Attention is the conversation — each person gathers information from others. FFN is the thinking that happens after — each person privately processes what they've collected and refines their own representation. No more communication, just computation.

In real transformers, FFN expands the dimension by 4× before compressing back: d_model → 4×d_model → d_model. For our example we use 6 → 3 → 6 to keep numbers manageable.

W₁ and W₂ are FFN-exclusive weight matrices — they appear here for the first time in this series. They have nothing to do with W_Q, W_K, W_V, or W_O from the attention layers. Every sub-layer in a transformer has its own independent set of weights, all learned separately during training.

\[ W_1 = \begin{bmatrix} 0.8 & 0.1 & 0.3 \\ 0.2 & 0.7 & 0.1 \\ 0.1 & 0.2 & 0.8 \\ 0.9 & 0.1 & 0.2 \\ 0.1 & 0.8 & 0.1 \\ 0.2 & 0.1 & 0.7 \end{bmatrix} \in \mathbb{R}^{6\times3} \qquad W_2 = \begin{bmatrix} 0.7 & 0.2 & 0.1 & 0.8 & 0.1 & 0.2 \\ 0.1 & 0.8 & 0.2 & 0.1 & 0.7 & 0.1 \\ 0.2 & 0.1 & 0.7 & 0.2 & 0.1 & 0.8 \end{bmatrix} \in \mathbb{R}^{3\times6} \]

Computing FFN for Alex step by step:

First layer — x₁ @ W₁:

\[ \begin{aligned} h_1 &= (-0.352)\times0.8 + 1.574\times0.2 + (-1.652)\times0.1 + 0.768\times0.9 + (-0.279)\times0.1 + (-0.059)\times0.2 \\ &= -0.282+0.315-0.165+0.691-0.028-0.012 = 0.520 \\[4pt] h_2 &= (-0.352)\times0.1 + 1.574\times0.7 + (-1.652)\times0.2 + 0.768\times0.1 + (-0.279)\times0.8 + (-0.059)\times0.1 \\ &= -0.035+1.102-0.330+0.077-0.223-0.006 = 0.584 \\[4pt] h_3 &= (-0.352)\times0.3 + 1.574\times0.1 + (-1.652)\times0.8 + 0.768\times0.2 + (-0.279)\times0.1 + (-0.059)\times0.7 \\ &= -0.106+0.157-1.322+0.154-0.028-0.041 = -1.185 \end{aligned} \]

Apply ReLU — set negative values to 0:

\[ \text{ReLU}([0.541,\ 0.584,\ -1.202]) = [0.541,\ 0.584,\ 0] \]

Second layer — ReLU(h) @ W₂:

\[ \begin{aligned} \text{FFN}_{\text{Alex}, d_0} &= 0.520\times0.7 + 0.584\times0.1 + 0\times0.2 = 0.364+0.058 = 0.422 \\ \text{FFN}_{\text{Alex}, d_1} &= 0.520\times0.2 + 0.584\times0.8 + 0\times0.1 = 0.104+0.467 = 0.571 \\ \text{FFN}_{\text{Alex}, d_2} &= 0.520\times0.1 + 0.584\times0.2 + 0\times0.7 = 0.052+0.117 = 0.169 \\ \text{FFN}_{\text{Alex}, d_3} &= 0.520\times0.8 + 0.584\times0.1 + 0\times0.2 = 0.416+0.058 = 0.474 \\ \text{FFN}_{\text{Alex}, d_4} &= 0.520\times0.1 + 0.584\times0.7 + 0\times0.1 = 0.052+0.409 = 0.461 \\ \text{FFN}_{\text{Alex}, d_5} &= 0.520\times0.2 + 0.584\times0.1 + 0\times0.8 = 0.104+0.058 = 0.162 \end{aligned} \]

FFN output for all four people:

\[ \text{FFN}(x_1) = \begin{bmatrix} & d_0 & d_1 & d_2 & d_3 & d_4 & d_5 \\ \text{Frank} & 0.430 & 0.536 & 0.163 & 0.478 & 0.432 & 0.155 \\ \text{Emma} & 0.204 & 0.661 & 0.192 & 0.228 & 0.553 & 0.119 \\ \text{Daniel} & 0.228 & 0.059 & 0.205 & 0.183 & 0.050 & 0.248 \\ \text{Alex} & 0.422 & 0.571 & 0.169 & 0.474 & 0.461 & 0.162 \end{bmatrix} \]

This time ReLU kept h₁ and h₂ active for Alex (both positive), zeroing only h₃. Different tokens activate different hidden dimensions — this sparsity is ReLU's contribution to the network's expressiveness.


6. Step 5 — Residual + LayerNorm again

The same pattern repeats after FFN: add the input back, then normalize.

\[ x_2^{\text{pre-norm}} = x_1 + \text{FFN}(x_1) \]

For Alex:

\[ \begin{aligned} x_{1,\text{Alex}} &= [-0.352,\ 1.574,\ -1.652,\ 0.768,\ -0.279,\ -0.059] \\ \text{FFN}(x_{1,\text{Alex}}) &= [0.422,\ 0.571,\ 0.169,\ 0.474,\ 0.461,\ 0.162] \\ x_{2,\text{Alex}}^{\text{pre-norm}} &= [0.070,\ 2.145,\ -1.483,\ 1.242,\ 0.182,\ 0.103] \end{aligned} \]

For all four people:

\[ x_2^{\text{pre-norm}} = \begin{bmatrix} & d_0 & d_1 & d_2 & d_3 & d_4 & d_5 \\ \text{Frank} & 0.798 & 2.214 & -1.442 & 0.846 & -0.041 & -0.180 \\ \text{Emma} & 0.323 & 2.569 & -1.254 & 0.282 & -0.073 & -0.037 \\ \text{Daniel} & -0.286 & -1.300 & -0.364 & 0.055 & 0.984 & 1.884 \\ \text{Alex} & 0.108 & 2.139 & -1.490 & 1.270 & 0.194 & 0.082 \end{bmatrix} \]

LayerNorm for Alex:

\[ \begin{aligned} \mu &= \frac{0.108+2.139+(-1.490)+1.270+0.194+0.082}{6} = \frac{2.303}{6} = 0.384 \\[6pt] \text{deviations} &= [-0.276,\ 1.755,\ -1.874,\ 0.886,\ -0.190,\ -0.302] \\ \sigma &= \sqrt{\frac{0.076+3.081+3.512+0.785+0.036+0.091}{6}} = \sqrt{1.264} = 1.124 \\[6pt] x_{2,\text{Alex}} &= \frac{[-0.276,\ 1.755,\ -1.874,\ 0.886,\ -0.190,\ -0.302]}{1.124} \\ &= [-0.246,\ 1.561,\ -1.667,\ 0.788,\ -0.169,\ -0.269] \end{aligned} \]

Final output of one complete decoder layer:

\[ x_2 = \begin{bmatrix} & d_0 & d_1 & d_2 & d_3 & d_4 & d_5 \\ \text{Frank} & 0.313 & 1.424 & -1.597 & 0.389 & -0.494 & -0.536 \\ \text{Emma} & -0.118 & 1.888 & -1.550 & -0.203 & -0.453 & -0.259 \\ \text{Daniel} & -0.416 & -1.479 & -0.729 & -0.187 & 0.743 & 2.068 \\ \text{Alex} & -0.274 & 1.577 & -1.659 & 0.771 & -0.174 & -0.244 \end{bmatrix} \]

Compare with X' that entered this layer:

\[ X' = \begin{bmatrix} & d_0 & d_1 & d_2 & d_3 & d_4 & d_5 \\ \text{Frank} & 1.821 & 1.490 & 0.167 & 1.969 & 0.152 & 1.080 \\ \text{Emma} & 1.019 & 0.544 & 1.033 & 1.086 & 0.134 & 1.180 \\ \text{Daniel} & 0.281 & -0.820 & 1.059 & 1.100 & 0.966 & 1.950 \\ \text{Alex} & 0.223 & 0.296 & 0.305 & 1.953 & 0.159 & 1.080 \end{bmatrix} \]

Compare with X' that entered this layer — the structure has shifted significantly. d₁ now dominates for all four people, a pattern that emerged from the combination of multi-head attention, residual connections, and FFN processing. The layer has done real work.


7. Stacking layers

x₂ is the output of one decoder layer. In GPT, this becomes the input to the next decoder layer — which runs the exact same sequence of operations.

current input

x⁽⁰⁾ = X′

output

what this layer learns

raw features, positional patterns

\[ x^{(0)} \xrightarrow{\text{layer 1}} x^{(1)} \xrightarrow{\text{layer 2}} x^{(2)} \xrightarrow{\ \cdots\ } x^{(L)} \]

where \(x^{(0)} = X'\) and L is the total number of layers.

A few things stay constant across all layers:

The weight matrices are different in every layer. W_Q, W_K, W_V, W₁, W₂ are all learned separately per layer. Layer 1's attention weights have nothing to do with layer 12's.

The shape stays the same. Every layer takes d_model dimensions in and produces d_model dimensions out. The architecture is a clean pipeline.

LayerNorm resets the scale after every sub-layer. No matter how large the values get inside a layer, the output is always normalized before passing forward.

Why does stacking help? A single attention layer can capture one level of relationships — who aligns with whom on the surface features. A second layer can capture relationships between those relationships — patterns that only emerge once the first layer has done its work. Deep networks learn hierarchical structure, and language has exactly that: words → phrases → sentences → discourse.


8. The full picture

One complete GPT decoder layer, from input to output:

X'
Input — position-aware features
X' = X + P from Part 3 — each row carries both content and position
Att
Masked Self-Attention
Each token attends to itself and all prior tokens — Part 4
+
Residual connection — x + Attention(x)
Add the original input back — preserve what was known, stack what was learned
LN
Layer Normalization
Normalize each token's vector to mean 0, std 1 — reset the scale
FFN
Feed-Forward Network (MLP)
Each token thinks alone — ReLU(x W₁) W₂ — expand, activate, compress
+
Residual connection — x₁ + FFN(x₁)
Add again — preserve the normalized features, stack the FFN refinement
LN
Layer Normalization again
Normalize once more — x₂ is ready for the next layer
×L
Repeat L times
GPT-2: 12 layers. GPT-3: 96 layers. Each refines the representations further.

Attention gathers. Residual preserves. LayerNorm stabilizes. FFN refines. Then repeat.

Frank spoke first and anchored the conversation. By the time x₂ emerges from layer 1, his influence has been gathered by attention, stabilized by LayerNorm, and further processed by FFN. In layer 2, the same thing happens again — but now operating on richer representations. By layer 96, the model has refined its understanding of the sequence 96 times. That depth is where the intelligence lives.