Masked Attention — Only Look Back

GPT generates one word at a time. Each word can only see what came before it — never what comes next.

1. The rule — only look back

Four people are seated in order, each about to make a restaurant recommendation. The rule is simple: you can only draw from people who spoke before you. No sneaking a look at what Daniel said before Emma has finished. No asking Alex for his opinion before he's had his turn.

Person	Position	Can consult
Frank	pos = 1	only himself
Emma	pos = 2	Frank, herself
Daniel	pos = 3	Frank, Emma, himself
Alex	pos = 4	Frank, Emma, Daniel, himself

This is not an arbitrary restriction. GPT generates one token at a time — left to right. When it is predicting the 4th word, the 5th word does not exist yet. Masked attention enforces this reality during training, so the model never learns to rely on information it will not have at generation time.

The attention matrix that results looks like this — a lower triangle of allowed connections, everything above the diagonal blocked:

\[ \text{allowed} = \begin{bmatrix} & \text{Frank} & \text{Emma} & \text{Daniel} & \text{Alex} \\ \text{Frank} & \checkmark & & & \\ \text{Emma} & \checkmark & \checkmark & & \\ \text{Daniel} & \checkmark & \checkmark & \checkmark & \\ \text{Alex} & \checkmark & \checkmark & \checkmark & \checkmark \end{bmatrix} \]

Every blank cell is a conversation that cannot happen. Emma cannot hear Daniel's opinion. Frank cannot hear anyone's but his own.

2. The mask — turning off the future

After computing S = Q'Kᵀ/√d, we have a 4×4 matrix of alignment scores. Every cell has a value — including the ones representing future conversations that should not happen. We kill those cells with a mask matrix M:

\[ M = \begin{bmatrix} & \text{Frank} & \text{Emma} & \text{Daniel} & \text{Alex} \\ \text{Frank} & 0 & -\infty & -\infty & -\infty \\ \text{Emma} & 0 & 0 & -\infty & -\infty \\ \text{Daniel} & 0 & 0 & 0 & -\infty \\ \text{Alex} & 0 & 0 & 0 & 0 \end{bmatrix} \]

We add M to S element-wise. The allowed cells get +0 — unchanged. The blocked cells get +(−∞) — destroyed. Why −∞? Because of what softmax does to it:

\[ \text{softmax}(-\infty) = \frac{e^{-\infty}}{\sum e^{x_j}} = \frac{0}{\sum e^{x_j}} = 0 \]

A score of −∞ produces an attention weight of exactly 0. The blocked person contributes nothing to the output — not a little, not a tiny fraction. Zero.

In practice −∞ is approximated by a very large negative number like −1e9, but the effect is the same: after softmax, those positions vanish completely.

3. Applying the mask — S + M → softmax

We use X' from Part 3 and the same W_Q, W_K from Part 1. First, Q' and K':

\[ Q' = \begin{bmatrix} & \text{street-stall} & \text{upscale} \\ \text{Frank} & 5.319 & 1.716 \\ \text{Emma} & 1.293 & 1.210 \\ \text{Daniel} & 0.742 & 1.921 \\ \text{Alex} & 2.589 & 1.589 \end{bmatrix} \qquad K' = \begin{bmatrix} & \text{knows-street-stall} & \text{knows-upscale} \\ \text{Frank} & 5.271 & 1.703 \\ \text{Emma} & 1.260 & 1.338 \\ \text{Daniel} & 0.701 & 2.796 \\ \text{Alex} & 2.565 & 1.577 \end{bmatrix} \]

Computing S = Q'Kᵀ / √2:

\[ S = \begin{bmatrix} & \text{Frank} & \text{Emma} & \text{Daniel} & \text{Alex} \\ \text{Frank} & 21.888 & 6.362 & 6.030 & 11.557 \\ \text{Emma} & 6.276 & 2.297 & 3.032 & 3.695 \\ \text{Daniel} & 5.079 & 2.479 & 4.165 & 3.489 \\ \text{Alex} & 11.559 & 3.810 & 4.426 & 6.468 \end{bmatrix} \]

Adding M — future positions become −∞:

\[ S + M = \begin{bmatrix} & \text{Frank} & \text{Emma} & \text{Daniel} & \text{Alex} \\ \text{Frank} & 21.888 & -\infty & -\infty & -\infty \\ \text{Emma} & 6.276 & 2.297 & -\infty & -\infty \\ \text{Daniel} & 5.079 & 2.479 & 4.165 & -\infty \\ \text{Alex} & 11.559 & 3.810 & 4.426 & 6.468 \end{bmatrix} \]

Applying softmax to each row:

\[ \text{softmax}([21.888, -\infty, -\infty, -\infty]) \approx [1.000,\ 0.000,\ 0.000,\ 0.000] \] \[ \text{softmax}([6.276, 2.297, -\infty, -\infty]) = \frac{[531.7,\ 9.94]}{541.6} \approx [0.982,\ 0.018,\ 0.000,\ 0.000] \] \[ \text{softmax}([5.079, 2.479, 4.165, -\infty]) = \frac{[160.8,\ 11.93,\ 64.34]}{237.1} \approx [0.678,\ 0.050,\ 0.271,\ 0.000] \] \[ \text{softmax}([11.559, 3.810, 4.426, 6.468]) = \frac{[104800,\ 45.15,\ 83.53,\ 643.4]}{105572} \approx [0.993,\ 0.000,\ 0.001,\ 0.006] \]

\[ A_{\text{masked}} = \begin{bmatrix} & \text{Frank} & \text{Emma} & \text{Daniel} & \text{Alex} \\ \text{Frank} & 1.000 & 0.000 & 0.000 & 0.000 \\ \text{Emma} & 0.982 & 0.018 & 0.000 & 0.000 \\ \text{Daniel} & 0.678 & 0.050 & 0.271 & 0.000 \\ \text{Alex} & 0.993 & 0.000 & 0.001 & 0.006 \end{bmatrix} \]

Frank — 100% self-attention. He has no one else to consult.
Emma — 98.2% Frank, 1.8% herself. Frank spoke first and his strong street-stall signal dominates.
Daniel — 67.8% Frank, 27.1% himself, 5.0% Emma. The first sign of genuine diversity — Daniel's own upscale pull starts to show.
Alex — 99.3% Frank, tiny contributions from Daniel and himself. Frank's positional boost makes him overwhelmingly dominant for everyone who comes after.

4. Output — what each person actually hears

V' = X' @ W_V gives us the content each person carries. These values use X' — the position-encoded input — matching the computation in Part 3:

\[ V' = \begin{bmatrix} & \text{info-sichuan} & \text{info-budget} & \text{info-upscale} \\ \text{Frank} & 3.963 & 2.121 & 1.743 \\ \text{Emma} & 2.308 & 1.114 & 2.427 \\ \text{Daniel} & 1.626 & 0.577 & 3.115 \\ \text{Alex} & 2.290 & 0.798 & 1.635 \end{bmatrix} \]

Output = A_masked @ V':

\[ \begin{aligned} \text{Frank}_{\text{info-sichuan}} &= 1.000\times3.963 = 3.963 \\ \text{Frank}_{\text{info-budget}} &= 1.000\times2.121 = 2.121 \\ \text{Frank}_{\text{info-upscale}} &= 1.000\times1.743 = 1.743 \\[6pt] \text{Emma}_{\text{info-sichuan}} &= 0.982\times3.963 + 0.018\times2.308 = 3.892 + 0.042 = 3.934 \\ \text{Emma}_{\text{info-budget}} &= 0.982\times2.121 + 0.018\times1.114 = 2.083 + 0.020 = 2.103 \\ \text{Emma}_{\text{info-upscale}} &= 0.982\times1.743 + 0.018\times2.427 = 1.711 + 0.044 = 1.755 \\[6pt] \text{Daniel}_{\text{info-sichuan}} &= 0.678\times3.963 + 0.050\times2.308 + 0.271\times1.626 = 2.687 + 0.115 + 0.441 = 3.243 \\ \text{Daniel}_{\text{info-budget}} &= 0.678\times2.121 + 0.050\times1.114 + 0.271\times0.577 = 1.438 + 0.056 + 0.156 = 1.650 \\ \text{Daniel}_{\text{info-upscale}} &= 0.678\times1.743 + 0.050\times2.427 + 0.271\times3.115 = 1.182 + 0.121 + 0.844 = 2.147 \\[6pt] \text{Alex}_{\text{info-sichuan}} &= 0.993\times3.963 + 0.000\times2.308 + 0.001\times1.626 + 0.006\times2.290 = 3.935 + 0 + 0.002 + 0.014 = 3.951 \\ \text{Alex}_{\text{info-budget}} &= 0.993\times2.121 + 0.000\times1.114 + 0.001\times0.577 + 0.006\times0.798 = 2.106 + 0 + 0.001 + 0.005 = 2.112 \\ \text{Alex}_{\text{info-upscale}} &= 0.993\times1.743 + 0.000\times2.427 + 0.001\times3.115 + 0.006\times1.635 = 1.731 + 0 + 0.003 + 0.010 = 1.744 \end{aligned} \]

\[ \text{Output}_{\text{masked}} = \begin{bmatrix} & \text{info-sichuan} & \text{info-budget} & \text{info-upscale} \\ \text{Frank} & 3.963 & 2.121 & 1.743 \\ \text{Emma} & 3.934 & 2.103 & 1.755 \\ \text{Daniel} & 3.243 & 1.650 & 2.147 \\ \text{Alex} & 3.951 & 2.112 & 1.744 \end{bmatrix} \]

Frank [3.963, 2.121, 1.743] — pure self-reference. His output is exactly his own V'. He spoke into a void — no prior context to draw from.
Emma [3.934, 2.103, 1.755] — almost entirely Frank. She could only look back at Frank, and Frank's amplified street-stall signal dominates.
Daniel [3.243, 1.650, 2.147] — the first genuine blend. His info-upscale (2.147) is notably higher than Frank's (1.743) — Daniel's own upscale V' starting to show.
Alex [3.951, 2.112, 1.744] — almost entirely Frank again. Frank's positional boost makes him overwhelmingly dominant for everyone who comes after.

Who speaks first matters enormously. Frank's positional encoding gave him an outsized Q signal. Everyone who came after — Emma, Daniel, Alex — ended up drawing heavily from Frank's V, regardless of their own preferences. In real language models, the early tokens in a prompt carry disproportionate influence over the entire generation. It's why prompt engineering pays so much attention to how you open — the first few words shape everything that follows.

5. The full picture

Masked attention is regular attention with one addition: a triangular mask applied to the score matrix before softmax. Everything else — Q, K, V, the dot product, the scaling, the weighted sum — is identical.

Position-aware input

X' = X + P — from Part 3

QKV

Compute Q', K', V'

Same as Parts 1 and 2 — project X' through W_Q, W_K, W_V

S = Q'Kᵀ / √d — alignment scores

Full 4×4 matrix — every pair scored

S + M — apply the mask

Future positions set to −∞ — they will vanish in softmax

softmax(S + M) → masked attention weights

−∞ becomes 0. Each row sums to 1 over allowed positions only.

∑

A_masked @ V' → output

Each person's output is a blend of only what they were allowed to hear

Masked attention doesn't change how attention works. It changes who attention is allowed to hear. The lower triangle stays open; the upper triangle is sealed. Frank speaks into silence. Alex speaks last and inherits everything — but only from those who came before.

In GPT, this mask is what makes generation possible. Without it, the model would learn to cheat during training — reading the answer before giving it. With it, every prediction is made honestly, from context alone.