Attention, But Make It Multi-Head

One question isn't enough. Alex, Emma, and Daniel ask two at once.


1. Why one head isn't enough

In Part 1, Alex asked one question: "Where can I get something spicy, fast, and cheap?"

The attention mechanism answered it well. It computed Q, matched it against K, and blended V accordingly. Alex ended up drawing mostly from his own knowledge, Daniel barely registered.

But consider what Alex actually needs to decide where to eat.

He has two concerns, not one. He cares about taste — spicy, fast, bold flavors. And he cares about budget — he's not going to a place Daniel would recommend, because Daniel's idea of dinner costs three times what Alex wants to spend.

A single attention head can only learn one projection. One W_Q, one W_K, one set of alignment scores. If the head learns to align on taste, it ignores budget. If it aligns on budget, it loses taste.

This is the problem multi-head attention solves.

Instead of one head asking one question, multiple heads ask different questions simultaneously — each with its own W_Q, W_K, W_V — and their answers get combined at the end.

In our case: two heads, two questions.

Head 1
Taste and vibe
Whose flavor profile and atmosphere preferences match mine?
Head 2
Budget
Whose spending range matches mine?

2. Two heads, two questions

Each head is a complete, independent attention computation. The input X is the same — Alex, Emma, and Daniel's preference vectors — but each head projects it differently.

\[ X = \begin{bmatrix} & \text{spicy} & \text{cheap} & \text{cozy} & \text{fast} & \text{flavor} & \text{vibe} \\ \text{Alex} & 0.98 & 0.95 & 0.12 & 0.97 & 0.15 & 0.08 \\ \text{Emma} & 0.11 & 0.96 & 0.94 & 0.09 & 0.13 & 0.18 \\ \text{Daniel} & 0.14 & 0.17 & 0.92 & 0.11 & 0.96 & 0.95 \end{bmatrix} \qquad X \in \mathbb{R}^{3 \times 6} \]

Head 1 uses \(W_Q^1, W_K^1, W_V^1\) — learned weights that project X into taste and vibe space. Head 2 uses \(W_Q^2, W_K^2, W_V^2\) — a completely separate set of weights that project X into budget space.

The two heads never share weights. They train independently, learn independently, and attend independently. The only thing they share is the input.

Think of it this way: you're asking two different friends for advice. One knows a lot about food. The other knows a lot about value for money. You give them both the same information about Alex, Emma, and Daniel — but they each filter it through a different lens and come back with a different answer. Multi-head attention does exactly this, in parallel, for every token in the sequence.


3. Head 1 — taste and vibe

Head 1 is identical to what we computed in Part 1. The weight matrices project X into intent/knows/info space:

\[ W_Q^1 = \left[\begin{array}{c:c|c} & \text{street-stall} & \text{upscale} \\ \hline \text{spicy} & 0.97 & 0.08 \\ \text{cheap} & 0.99 & 0.11 \\ \text{cozy} & 0.12 & 0.96 \\ \text{fast} & 0.98 & 0.09 \\ \text{flavor} & 0.13 & 0.07 \\ \text{vibe} & 0.10 & 0.98 \end{array}\right] \quad W_K^1 = \left[\begin{array}{c:c|c} & \text{knows-street-stall} & \text{knows-upscale-flavor} \\ \hline \text{spicy} & 0.96 & 0.09 \\ \text{cheap} & 0.98 & 0.11 \\ \text{cozy} & 0.10 & 0.97 \\ \text{fast} & 0.97 & 0.08 \\ \text{flavor} & 0.12 & 0.96 \\ \text{vibe} & 0.09 & 0.99 \end{array}\right] \quad W_V^1 = \left[\begin{array}{c:c:c|c} & \text{info-sichuan} & \text{info-budget} & \text{info-upscale} \\ \hline \text{spicy} & 0.97 & 0.11 & 0.09 \\ \text{cheap} & 0.10 & 0.98 & 0.08 \\ \text{cozy} & 0.09 & 0.12 & 0.96 \\ \text{fast} & 0.98 & 0.10 & 0.11 \\ \text{flavor} & 0.11 & 0.97 & 0.09 \\ \text{vibe} & 0.08 & 0.09 & 0.99 \end{array}\right] \]

The full derivation is in Part 1. The results:

\[ A^1 = \begin{bmatrix} & \text{Alex} & \text{Emma} & \text{Daniel} \\ \text{Alex} & 0.929 & 0.046 & 0.024 \\ \text{Emma} & 0.426 & 0.186 & 0.388 \\ \text{Daniel} & 0.118 & 0.138 & 0.744 \end{bmatrix} \]
\[ \text{Output}^1 = \begin{bmatrix} & \text{info-sichuan} & \text{info-budget} & \text{info-upscale} \\ \text{Alex} & 1.909 & 1.270 & 0.533 \\ \text{Emma} & 1.142 & 1.292 & 1.068 \\ \text{Daniel} & 0.697 & 1.310 & 1.666 \end{bmatrix} \]

Head 1 tells us: Alex's street-stall craving matched his own Sichuan knowledge — 92.9% self-attention. Daniel's upscale craving matched his own upscale knowledge — 74.4% self-attention. Emma sits between both, drawing from each.

But Head 1 knows nothing about what things cost.


4. Head 2 — budget

Head 2 asks a different question entirely. The weight matrices project X into budget space — who's asking about affordable places, and who's asking about high-end experiences.

\[ W_Q^2 = \left[\begin{array}{c:c|c} & \text{budget}_1\text{(low)} & \text{budget}_2\text{(high)} \\ \hline \text{spicy} & 0.95 & 0.08 \\ \text{cheap} & 0.98 & 0.06 \\ \text{cozy} & 0.09 & 0.94 \\ \text{fast} & 0.91 & 0.11 \\ \text{flavor} & 0.10 & 0.96 \\ \text{vibe} & 0.07 & 0.97 \end{array}\right] \quad W_K^2 = \left[\begin{array}{c:c|c} & \text{knows\_low} & \text{knows\_high} \\ \hline \text{spicy} & 0.94 & 0.09 \\ \text{cheap} & 0.97 & 0.07 \\ \text{cozy} & 0.08 & 0.95 \\ \text{fast} & 0.92 & 0.10 \\ \text{flavor} & 0.09 & 0.97 \\ \text{vibe} & 0.08 & 0.96 \end{array}\right] \quad W_V^2 = \left[\begin{array}{c:c|c} & \text{price\_info} & \text{experience\_info} \\ \hline \text{spicy} & 0.96 & 0.09 \\ \text{cheap} & 0.97 & 0.08 \\ \text{cozy} & 0.10 & 0.95 \\ \text{fast} & 0.93 & 0.11 \\ \text{flavor} & 0.09 & 0.96 \\ \text{vibe} & 0.08 & 0.97 \end{array}\right] \]

W_Q² — budget₁ captures low-budget intent (spicy, cheap, fast), budget₂ captures high-budget intent (cozy, flavor, vibe)
W_K² — knows_low signals knowledge of affordable places, knows_high signals knowledge of premium places
W_V² — price_info carries concrete budget knowledge, experience_info carries high-end experience knowledge

Q² = X @ W_Q²

\[ Q^2 = X W_Q^2 = \begin{bmatrix} 0.98 & 0.95 & 0.12 & 0.97 & 0.15 & 0.08 \\ 0.11 & 0.96 & 0.94 & 0.09 & 0.13 & 0.18 \\ 0.14 & 0.17 & 0.92 & 0.11 & 0.96 & 0.95 \end{bmatrix} \begin{bmatrix} 0.95 & 0.08 \\ 0.98 & 0.06 \\ 0.09 & 0.94 \\ 0.91 & 0.11 \\ 0.10 & 0.96 \\ 0.07 & 0.97 \end{bmatrix} = \begin{bmatrix} & \text{budget}_1 & \text{budget}_2 \\ \text{Alex} & 2.777 & 0.577 \\ \text{Emma} & 1.239 & 1.261 \\ \text{Daniel} & 0.646 & 2.742 \end{bmatrix} \]

K² = X @ W_K²

\[ K^2 = X W_K^2 = \begin{bmatrix} 0.98 & 0.95 & 0.12 & 0.97 & 0.15 & 0.08 \\ 0.11 & 0.96 & 0.94 & 0.09 & 0.13 & 0.18 \\ 0.14 & 0.17 & 0.92 & 0.11 & 0.96 & 0.95 \end{bmatrix} \begin{bmatrix} 0.94 & 0.09 \\ 0.97 & 0.07 \\ 0.08 & 0.95 \\ 0.92 & 0.10 \\ 0.09 & 0.97 \\ 0.08 & 0.96 \end{bmatrix} = \begin{bmatrix} & \text{knows\_low} & \text{knows\_high} \\ \text{Alex} & 2.765 & 0.589 \\ \text{Emma} & 1.218 & 1.278 \\ \text{Daniel} & 0.634 & 2.753 \end{bmatrix} \]

V² = X @ W_V²

\[ V^2 = X W_V^2 = \begin{bmatrix} 0.98 & 0.95 & 0.12 & 0.97 & 0.15 & 0.08 \\ 0.11 & 0.96 & 0.94 & 0.09 & 0.13 & 0.18 \\ 0.14 & 0.17 & 0.92 & 0.11 & 0.96 & 0.95 \end{bmatrix} \begin{bmatrix} 0.96 & 0.09 \\ 0.97 & 0.08 \\ 0.10 & 0.95 \\ 0.93 & 0.11 \\ 0.09 & 0.96 \\ 0.08 & 0.97 \end{bmatrix} = \begin{bmatrix} & \text{price\_info} & \text{experience\_info} \\ \text{Alex} & 2.797 & 0.607 \\ \text{Emma} & 1.241 & 1.290 \\ \text{Daniel} & 0.655 & 2.757 \end{bmatrix} \]

Alex Q²=[2.777, 0.577] — asking firmly in the low-budget direction. V²=[2.797, 0.607] — his knowledge is mainly about affordable places.
Emma Q²=[1.239, 1.261] — balanced between both budget ranges.
Daniel Q²=[0.646, 2.742] — asking firmly in the high-budget direction. V²=[0.655, 2.757] — his knowledge is mainly about premium experiences.

S² = Q²Kᵀ² / √2

\[ S^2 \approx \begin{bmatrix} & \text{Alex} & \text{Emma} & \text{Daniel} \\ \text{Alex} & 5.671 & 2.912 & 2.368 \\ \text{Emma} & 2.948 & 2.207 & 3.010 \\ \text{Daniel} & 2.405 & 3.034 & 5.627 \end{bmatrix} \]

Softmax → A²

\[ \text{softmax}([5.671, 2.912, 2.368]) \approx [0.909,\ 0.058,\ 0.033] \] \[ \text{softmax}([2.948, 2.207, 3.010]) \approx [0.394,\ 0.188,\ 0.419] \] \[ \text{softmax}([2.405, 3.034, 5.627]) \approx [0.036,\ 0.067,\ 0.897] \]
\[ A^2 = \begin{bmatrix} & \text{Alex} & \text{Emma} & \text{Daniel} \\ \text{Alex} & 0.909 & 0.058 & 0.033 \\ \text{Emma} & 0.394 & 0.188 & 0.419 \\ \text{Daniel} & 0.036 & 0.067 & 0.897 \end{bmatrix} \]

Compare A² with A¹:

PersonSelf-attention (Head 1)Self-attention (Head 2)
Alex0.9290.909
Emma0.1860.188
Daniel0.7440.897

Daniel's self-attention jumps from 0.744 to 0.897 in Head 2 — on budget, he is even more isolated. Nobody else is in his league. Alex stays similarly self-referential. Emma remains the most balanced in both heads.

Output² = A² @ V²

\[ \text{Output}^2 = A^2 \cdot V^2 = \begin{bmatrix} 0.909 & 0.058 & 0.033 \\ 0.394 & 0.188 & 0.419 \\ 0.036 & 0.067 & 0.897 \end{bmatrix} \begin{bmatrix} 2.797 & 0.607 \\ 1.241 & 1.290 \\ 0.655 & 2.757 \end{bmatrix} = \begin{bmatrix} & \text{price\_info} & \text{experience\_info} \\ \text{Alex} & 2.614 & 0.734 \\ \text{Emma} & 1.607 & 1.633 \\ \text{Daniel} & 0.772 & 2.572 \end{bmatrix} \]

Head 2 captures something Head 1 completely missed: Alex and Daniel are on opposite ends of the budget spectrum, and Emma is exactly in the middle.


5. Concatenate and project

Each head has produced an output matrix. Now we combine them.

The combination is not averaging, and not adding — it is concatenation. Each person's two output vectors are placed end to end, forming a single longer vector:

\[ \text{Concat} = [\text{Output}^1 \,\|\, \text{Output}^2] = \begin{bmatrix} & \text{info-sichuan} & \text{info-budget} & \text{info-upscale} & \text{price\_info} & \text{exp\_info} \\ \text{Alex} & 1.909 & 1.270 & 0.533 & 2.614 & 0.734 \\ \text{Emma} & 1.142 & 1.292 & 1.068 & 1.607 & 1.633 \\ \text{Daniel} & 0.697 & 1.310 & 1.666 & 0.772 & 2.572 \end{bmatrix} \]

This 5-dimensional vector now contains everything — taste, vibe, and budget — all in one place. But it's in the wrong shape to feed into the next layer, which expects the original 6-dimensional space. So we project it back down with W_O ∈ ℝ⁵ˣ⁶, a learned weight matrix that decides how to blend the two heads' contributions:

\[ W_O = \begin{bmatrix} & \text{spicy} & \text{cheap} & \text{cozy} & \text{fast} & \text{flavor} & \text{vibe} \\ \text{info-sichuan} & 0.96 & 0.09 & 0.08 & 0.95 & 0.10 & 0.07 \\ \text{info-budget} & 0.10 & 0.94 & 0.11 & 0.09 & 0.96 & 0.08 \\ \text{info-upscale} & 0.08 & 0.10 & 0.95 & 0.07 & 0.11 & 0.97 \\ \text{price\_info} & 0.11 & 0.96 & 0.09 & 0.10 & 0.12 & 0.08 \\ \text{exp\_info} & 0.09 & 0.10 & 0.12 & 0.08 & 0.95 & 0.96 \end{bmatrix} \]

Each row maps one concatenated dimension back into the original feature space. info-sichuan flows strongly into spicy and fast. info-upscale flows into cozy and vibe. price_info flows into cheap. experience_info flows into flavor and vibe.

Final Output = Concat @ W_O

\[ \text{Final Output} = \begin{bmatrix} & \text{spicy} & \text{cheap} & \text{cozy} & \text{fast} & \text{flavor} & \text{vibe} \\ \text{Alex} & 2.357 & 4.001 & 1.122 & 2.285 & 2.480 & 1.667 \\ \text{Emma} & 1.634 & 3.130 & 1.589 & 1.568 & 3.215 & 2.916 \\ \text{Daniel} & 1.249 & 2.459 & 2.161 & 1.180 & 4.047 & 4.301 \end{bmatrix} \]

Let's expand Alex's row explicitly to see how W_O combines both heads:

\[ \begin{aligned} \text{Alex}_{\text{spicy}} &= 1.909\times0.96 + 1.270\times0.10 + 0.533\times0.08 + 2.614\times0.11 + 0.734\times0.09 = 2.357 \\ \text{Alex}_{\text{cheap}} &= 1.909\times0.09 + 1.270\times0.94 + 0.533\times0.10 + 2.614\times0.96 + 0.734\times0.10 = 4.001 \\ \text{Alex}_{\text{cozy}} &= 1.909\times0.08 + 1.270\times0.11 + 0.533\times0.95 + 2.614\times0.09 + 0.734\times0.12 = 1.122 \\ \text{Alex}_{\text{fast}} &= 1.909\times0.95 + 1.270\times0.09 + 0.533\times0.07 + 2.614\times0.10 + 0.734\times0.08 = 2.285 \\ \text{Alex}_{\text{flavor}} &= 1.909\times0.10 + 1.270\times0.96 + 0.533\times0.11 + 2.614\times0.12 + 0.734\times0.95 = 2.480 \\ \text{Alex}_{\text{vibe}} &= 1.909\times0.07 + 1.270\times0.08 + 0.533\times0.97 + 2.614\times0.08 + 0.734\times0.96 = 1.667 \end{aligned} \]

Notice why cheap = 4.001 is so dominant. Two terms drive it:

\[ \underbrace{1.270 \times 0.94}_{\text{info-budget} \to \text{cheap}} + \underbrace{2.614 \times 0.96}_{\text{price\_info} \to \text{cheap}} = 1.194 + 2.509 = 3.703 \]

Head 1's budget signal and Head 2's price signal both flow strongly into cheap via W_O — two heads, two angles, both pointing at the same dimension. W_O is where those contributions stack. This is the payoff of multi-head attention: independent views of the same input, combined into a richer final representation than any single head could produce alone.

Alex [2.357, 4.001, 1.122, 2.285, 2.480, 1.667] — cheap dominates. Head 1 said he craves street-stall food; Head 2 said he's low-budget. W_O amplified both signals into a strong cheap output.
Emma [1.634, 3.130, 1.589, 1.568, 3.215, 2.916] — flavor and cheap lead. Both heads found her balanced between the two extremes, and the final output reflects that.
Daniel [1.249, 2.459, 2.161, 1.180, 4.047, 4.301] — flavor and vibe dominate. Head 1 said he craves atmosphere; Head 2 said he's high-budget. W_O wove both signals into a strong flavor+vibe output.

Notice what single-head attention couldn't have produced: Daniel's cheap score (2.459) is notably lower than Alex's (4.001), even though cheap is a feature in X for all three. The budget head added information the taste head never saw, and W_O wove them together.


6. The full picture

Multi-head attention is not a fundamentally different operation from single-head attention. It is the same operation, run in parallel, multiple times, each time asking a different question — then combining the answers.

For each head \(h = 1, 2, \ldots, H\):

Q_h = X W_Q_h, K_h = X W_K_h, V_h = X W_V_h
Output_h = softmax( Q_h K_hᵀ / √d_k ) V_h

Then concatenate and project:

Final Output = Concat[Output_1, Output_2, ..., Output_H] · W_O

In our example, H = 2. In GPT-3, H = 96.

H1
Head 1 — taste and vibe
Independent Q¹, K¹, V¹ → Output¹ (3-dim)
H2
Head 2 — budget
Independent Q², K², V² → Output² (2-dim)
Concatenate
Stack Output¹ and Output² end to end → 5-dim vector
W_O
Project with W_O
Blend both heads' contributions → back to 6-dim

What each piece does:

Multiple W_Q, W_K, W_V — each head learns to ask a different question. No coordination between heads during training; they find their own niches.

Independent softmax per head — each head produces its own attention distribution. Head 1 might attend broadly; Head 2 might attend sharply. They don't interfere.

Concatenation — preserves everything. No information from either head is discarded before W_O sees it.

W_O — the mixer. It learns how to combine the heads' contributions into a coherent final representation. This is where "taste matters more than budget in this context" gets decided.

Multi-head attention lets the model maintain parallel, non-interfering views of the same input. Each head specialises. W_O synthesises. You didn't ask one friend "who shares both my taste and my budget?" — you asked two friends two clean questions, listened to both, and combined them yourself. That combination is W_O.