Attention Is Just Deciding Where To Eat

Three people are trying to pick a restaurant tonight. Each has their own preferences, and each knows something about the options. The question is: how do you combine everyone's knowledge so that each person ends up with the recommendation that fits them best? That's exactly what attention does.

1. Three people, three feature profiles

Alex, Emma, and Daniel are going out for dinner tonight. Each person carries a feature profile — a set of traits that characterize who they are and what they care about. These features aren't just preferences. As we'll see, the same profile will be decoded in three different ways: what they're craving, what restaurants they know about, and what specific information they're holding.

Alex

spicy cheap fast

Emma

cheap cozy

Daniel

flavor vibe cozy

We encode each person as a vector across six dimensions — [spicy, cheap, cozy, fast, flavor, vibe]. This is embedding — compressing who someone is into numbers. The same vector X will later be projected three ways to extract three different kinds of meaning:

\[ X = \begin{bmatrix} & \text{spicy} & \text{cheap} & \text{cozy} & \text{fast} & \text{flavor} & \text{vibe} \\ \text{Alex} & 0.98 & 0.95 & 0.12 & 0.97 & 0.15 & 0.08 \\ \text{Emma} & 0.11 & 0.96 & 0.94 & 0.09 & 0.13 & 0.18 \\ \text{Daniel} & 0.14 & 0.17 & 0.92 & 0.11 & 0.96 & 0.95 \end{bmatrix} \qquad X \in \mathbb{R}^{3 \times 6} \]

Each row is one person. Each column is one feature. The same X is the starting point for Q, K, and V — three decoders, three different readings of the same profile.

2. One profile, three decoders

The same feature profile X gets run through three different decoders — Q, K, and V. Each decoder is specialized for a different kind of reading. They don't see different people; they see the same person differently:

Decoder for cravings: what do I want to eat?

Query — a demand decoder. Given your features, W_Q estimates the strength of what you're hungry for. Alex's profile (spicy, cheap, fast) gets decoded into: "strong street-stall craving, almost no upscale pull."

Decoder for knowledge: where do I know good food?

Key — a knowledge decoder. Given your features, W_K infers what type of knowledge you're likely carrying. Alex's profile (spicy, cheap, fast) gets decoded into: "this person knows the street-stall scene well."

Decoder for content: what are the specific details I'm holding?

Value — a content extractor. Given your features, W_V pulls out the actual substance you're carrying: not a label, not a signal — the real information that gets passed forward.

Three decoders, three specialized readings of the same profile. Each one extracts something different from the same X.

Q vs K vs V in plain terms: Q asks "what am I craving?" — it drives the search. K says "I know this type of place" — it's the label that gets matched against Q. V says "here's the actual info" — the restaurant name, the dish, the wait time. Q and K play the matching game; V delivers the goods.

W_Q ∈ ℝ^6×2 — a demand decoder: given your feature profile, it estimates how strongly you're craving each type of experience

\[ W_Q = \left[\begin{array}{c:c|c} & \text{street-stall} & \text{upscale} \\ \hline \text{spicy} & 0.97 & 0.08 \\ \text{cheap} & 0.99 & 0.11 \\ \text{cozy} & 0.12 & 0.96 \\ \text{fast} & 0.98 & 0.09 \\ \text{flavor} & 0.13 & 0.07 \\ \text{vibe} & 0.10 & 0.98 \end{array}\right] \]

Each row is an input feature. Each column is a query direction. spicy contributes to street-stall ("I want it spicy, fast, and cheap"), cozy contributes to upscale ("I want good atmosphere and comfort").

street-stall — cares about spice, price, speed — nothing else matters
upscale — cares about vibe and coziness — price and heat are irrelevant

W_K ∈ ℝ^6×2 — a knowledge decoder: given your feature profile, it decodes what kinds of places you're likely to know about

\[ W_K = \left[\begin{array}{c:c|c} & \text{knows-street-stall} & \text{knows-upscale-flavor} \\ \hline \text{spicy} & 0.96 & 0.09 \\ \text{cheap} & 0.98 & 0.11 \\ \text{cozy} & 0.10 & 0.97 \\ \text{fast} & 0.97 & 0.08 \\ \text{flavor} & 0.12 & 0.96 \\ \text{vibe} & 0.09 & 0.99 \end{array}\right] \]

W_K and W_Q are deliberately different. "Craving bold food" and "knowing where bold food is" are two different things — the model needs separate weights to learn them independently.

knows-street-stall — knows where to find spicy, cheap, fast places
knows-upscale-flavor — knows upscale spots with great atmosphere and genuine flavor

W_V ∈ ℝ^6×3 — a content extractor: given your feature profile, it pulls out the actual knowledge you're holding — no fluff, just the substance

\[ W_V = \left[\begin{array}{c:c:c|c} & \text{info-sichuan} & \text{info-budget} & \text{info-upscale} \\ \hline \text{spicy} & 0.97 & 0.11 & 0.09 \\ \text{cheap} & 0.10 & 0.98 & 0.08 \\ \text{cozy} & 0.09 & 0.12 & 0.96 \\ \text{fast} & 0.98 & 0.10 & 0.11 \\ \text{flavor} & 0.11 & 0.97 & 0.09 \\ \text{vibe} & 0.08 & 0.09 & 0.99 \end{array}\right] \]

V has three output dimensions — it carries content, not a matching signal.

info-sichuan — knows Sichuan-style spots: spicy, fast — which restaurants, how long they've been open, what to order
info-budget — knows budget-friendly spots with real flavor: cheap eats that actually taste good
info-upscale — knows the upscale scene: cozy, great vibe, places worth the price

3. Alignment scores

Now we run X through each decoder to extract what each person is craving, what they know, and what content they're holding.

Q = X W_Q — "you have these features, let me decode what you're craving and how strongly"

\[ Q = X W_Q = \begin{bmatrix} 0.98 & 0.95 & 0.12 & 0.97 & 0.15 & 0.08 \\ 0.11 & 0.96 & 0.94 & 0.09 & 0.13 & 0.18 \\ 0.14 & 0.17 & 0.92 & 0.11 & 0.96 & 0.95 \end{bmatrix} \begin{bmatrix} 0.97 & 0.08 \\ 0.99 & 0.11 \\ 0.12 & 0.96 \\ 0.98 & 0.09 \\ 0.13 & 0.07 \\ 0.10 & 0.98 \end{bmatrix} = \begin{bmatrix} & \text{street-stall} & \text{upscale} \\ \text{Alex} & 2.885 & 0.474 \\ \text{Emma} & 1.293 & 1.210 \\ \text{Daniel} & 0.742 & 1.921 \end{bmatrix} \]

Alex [2.885, 0.474] — W_Q decoded his profile (spicy↑ cheap↑ fast↑) into a strong street-stall craving. He didn't say it out loud — his features gave it away.
Emma [1.293, 1.210] — balanced craving in both directions. Her mixed profile (cheap↑ cozy↑) doesn't pull strongly either way.
Daniel [0.742, 1.921] — W_Q decoded his profile (cozy↑ flavor↑ vibe↑) into a strong upscale craving. He wants atmosphere and quality, not speed or heat.

K = X W_K — "you have these features, let me infer what knowledge you're carrying"

\[ K = X W_K = \begin{bmatrix} 0.98 & 0.95 & 0.12 & 0.97 & 0.15 & 0.08 \\ 0.11 & 0.96 & 0.94 & 0.09 & 0.13 & 0.18 \\ 0.14 & 0.17 & 0.92 & 0.11 & 0.96 & 0.95 \end{bmatrix} \begin{bmatrix} 0.96 & 0.09 \\ 0.98 & 0.11 \\ 0.10 & 0.97 \\ 0.97 & 0.08 \\ 0.12 & 0.96 \\ 0.09 & 0.99 \end{bmatrix} = \begin{bmatrix} & \text{knows-street-stall} & \text{knows-upscale-flavor} \\ \text{Alex} & 2.850 & 0.610 \\ \text{Emma} & 1.260 & 1.338 \\ \text{Daniel} & 0.701 & 2.796 \end{bmatrix} \]

Alex [2.850, 0.610] — W_K decoded his profile (spicy↑ cheap↑ fast↑) into strong street-stall knowledge. People like him tend to know exactly which stalls are worth it.
Emma [1.260, 1.338] — balanced knowledge signal. Her mixed profile suggests she has something useful in both directions.
Daniel [0.701, 2.796] — W_K decoded his profile (cozy↑ flavor↑ vibe↑) into deep upscale-flavor knowledge. He's the one to ask about atmosphere and quality.

Note: Alex's Q and K are both strong in the first dimension — a near-coincidence of these numbers. In a real model, W_Q and W_K are trained separately and almost never produce the same result. That is precisely why two matrices are needed.

V = X W_V — "you have these features, let me extract the actual content you're carrying"

\[ V = X W_V = \begin{bmatrix} 0.98 & 0.95 & 0.12 & 0.97 & 0.15 & 0.08 \\ 0.11 & 0.96 & 0.94 & 0.09 & 0.13 & 0.18 \\ 0.14 & 0.17 & 0.92 & 0.11 & 0.96 & 0.95 \end{bmatrix} \begin{bmatrix} 0.97 & 0.11 & 0.09 \\ 0.10 & 0.98 & 0.08 \\ 0.09 & 0.12 & 0.96 \\ 0.98 & 0.10 & 0.11 \\ 0.11 & 0.97 & 0.09 \\ 0.08 & 0.09 & 0.99 \end{bmatrix} = \begin{bmatrix} & \text{info-sichuan} & \text{info-budget} & \text{info-upscale} \\ \text{Alex} & 2.031 & 1.303 & 0.479 \\ \text{Emma} & 0.404 & 1.217 & 1.189 \\ \text{Daniel} & 0.526 & 1.320 & 1.949 \end{bmatrix} \]

Alex [2.031, 1.303, 0.479] — W_V extracted mostly Sichuan content from his profile: which stalls, how long open, what to order. His features don't carry much upscale substance.
Emma [0.404, 1.217, 1.189] — W_V extracted a balanced content mix: some budget-flavor knowledge, some upscale. She's a useful source in both directions.
Daniel [0.526, 1.320, 1.949] — W_V extracted mostly upscale content: the cozy spots, the vibe, which places are worth the price. Little Sichuan substance in his profile.

4. Scoring craving against knowledge

For each person, we take their decoded craving (Q) and score it against every person's decoded knowledge (K) using the dot product. The higher the overlap between what one person is craving and what another person knows, the higher the score.

S[i][j] = Q[i] · K[j] answers: how well does person i's craving align with what person j knows?

Expanding each pair explicitly:

\[ \begin{aligned} Q_{\text{Alex}} \cdot K_{\text{Alex}} &= [2.885, 0.474] \cdot [2.850, 0.610] = 2.885\times2.850 + 0.474\times0.610 = 8.511 \\ Q_{\text{Alex}} \cdot K_{\text{Emma}} &= [2.885, 0.474] \cdot [1.260, 1.338] = 2.885\times1.260 + 0.474\times1.338 = 4.269 \\ Q_{\text{Alex}} \cdot K_{\text{Daniel}} &= [2.885, 0.474] \cdot [0.701, 2.796] = 2.885\times0.701 + 0.474\times2.796 = 3.347 \\[6pt] Q_{\text{Emma}} \cdot K_{\text{Alex}} &= [1.293, 1.210] \cdot [2.850, 0.610] = 1.293\times2.850 + 1.210\times0.610 = 4.423 \\ Q_{\text{Emma}} \cdot K_{\text{Emma}} &= [1.293, 1.210] \cdot [1.260, 1.338] = 1.293\times1.260 + 1.210\times1.338 = 3.248 \\ Q_{\text{Emma}} \cdot K_{\text{Daniel}} &= [1.293, 1.210] \cdot [0.701, 2.796] = 1.293\times0.701 + 1.210\times2.796 = 4.289 \\[6pt] Q_{\text{Daniel}} \cdot K_{\text{Alex}} &= [0.742, 1.921] \cdot [2.850, 0.610] = 0.742\times2.850 + 1.921\times0.610 = 3.287 \\ Q_{\text{Daniel}} \cdot K_{\text{Emma}} &= [0.742, 1.921] \cdot [1.260, 1.338] = 0.742\times1.260 + 1.921\times1.338 = 3.506 \\ Q_{\text{Daniel}} \cdot K_{\text{Daniel}} &= [0.742, 1.921] \cdot [0.701, 2.796] = 0.742\times0.701 + 1.921\times2.796 = 5.891 \end{aligned} \]

These 9 dot products can be computed in one step using matrix multiplication. Q is 3×2 and K is also 3×2 — they can't be multiplied directly. Transposing K to 2×3 makes the shapes compatible, and Q @ Kᵀ batches all 9 dot products at once:

\[ QK^\top = \begin{bmatrix} 2.885 & 0.474 \\ 1.293 & 1.210 \\ 0.742 & 1.921 \end{bmatrix} \begin{bmatrix} 2.850 & 1.260 & 0.701 \\ 0.610 & 1.338 & 2.796 \end{bmatrix} = \begin{bmatrix} & \text{Alex} & \text{Emma} & \text{Daniel} \\ \text{Alex} & 8.511 & 4.269 & 3.347 \\ \text{Emma} & 4.423 & 3.248 & 4.289 \\ \text{Daniel} & 3.287 & 3.506 & 5.891 \end{bmatrix} \]

Before softmax, we divide by \(\sqrt{d_k}\) where \(d_k = 2\) is the dimension of Q and K. This prevents the dot products from growing too large and making softmax overly sharp:

\[ S = \frac{QK^\top}{\sqrt{d_k}} = \frac{QK^\top}{\sqrt{2}} \approx \begin{bmatrix} & \text{Alex} & \text{Emma} & \text{Daniel} \\ \text{Alex} & 6.018 & 3.018 & 2.367 \\ \text{Emma} & 3.127 & 2.297 & 3.032 \\ \text{Daniel} & 2.325 & 2.479 & 4.165 \end{bmatrix} \]

Alex → Alex = 8.511 → 6.018 — his intent (street-stall=2.885) strongly matches his own knowledge (knows-street-stall=2.850).
Alex → Daniel = 3.347 → 2.367 — some overlap, but Alex's street-stall doesn't align well with Daniel's knows-upscale-flavor-dominant profile.
Daniel → Daniel = 5.891 → 4.165 — Daniel's intent (upscale=1.921) strongly aligns with his own knowledge (knows-upscale-flavor=2.796).

5. From scores to weights — what softmax actually does

We now apply softmax to each row of S. Each row represents one person's scores across all others — softmax turns these into attention weights: how much should I draw from each person's knowledge?

Alex's row: [6.018, 3.018, 2.367]

\[ \text{softmax}([6.018, 3.018, 2.367]) = \frac{[e^{6.018},\ e^{3.018},\ e^{2.367}]}{e^{6.018} + e^{3.018} + e^{2.367}} = \frac{[409.9,\ 20.45,\ 10.67]}{441.0} \approx [0.929,\ 0.046,\ 0.024] \]

Emma's row: [3.127, 2.297, 3.032]

\[ \text{softmax}([3.127, 2.297, 3.032]) = \frac{[e^{3.127},\ e^{2.297},\ e^{3.032}]}{e^{3.127} + e^{2.297} + e^{3.032}} = \frac{[22.79,\ 9.94,\ 20.74]}{53.47} \approx [0.426,\ 0.186,\ 0.388] \]

Daniel's row: [2.325, 2.479, 4.165]

\[ \text{softmax}([2.325, 2.479, 4.165]) = \frac{[e^{2.325},\ e^{2.479},\ e^{4.165}]}{e^{2.325} + e^{2.479} + e^{4.165}} = \frac{[10.22,\ 11.93,\ 64.34]}{86.49} \approx [0.118,\ 0.138,\ 0.744] \]

Collecting all three rows into a matrix:

\[ A = \text{softmax}(S) = \begin{bmatrix} & \text{Alex} & \text{Emma} & \text{Daniel} \\ \text{Alex} & 0.929 & 0.046 & 0.024 \\ \text{Emma} & 0.426 & 0.186 & 0.388 \\ \text{Daniel} & 0.118 & 0.138 & 0.744 \end{bmatrix} \]

Alex — scores [6.018, 3.018, 2.367] have a large gap, so he draws mostly from himself (0.929).
Emma — scores [3.127, 2.297, 3.032] are close, so she draws roughly equally from Alex (0.426) and Daniel (0.388).
Daniel — scores [2.325, 2.479, 4.165] heavily favor himself (0.744).

6. The output — combining V by weight

The softmax weights tell us how much to trust each person. We apply those weights to each person's V — their actual content — and sum the result:

Output = softmax( Q @ Kᵀ / √d ) @ V

\[ \text{Output} = A \cdot V = \begin{bmatrix} 0.929 & 0.046 & 0.024 \\ 0.426 & 0.186 & 0.388 \\ 0.118 & 0.138 & 0.744 \end{bmatrix} \begin{bmatrix} 2.031 & 1.303 & 0.479 \\ 0.404 & 1.217 & 1.189 \\ 0.526 & 1.320 & 1.949 \end{bmatrix} = \begin{bmatrix} & \text{info-sichuan} & \text{info-budget} & \text{info-upscale} \\ \text{Alex} & 1.909 & 1.270 & 0.533 \\ \text{Emma} & 1.142 & 1.292 & 1.068 \\ \text{Daniel} & 0.697 & 1.310 & 1.666 \end{bmatrix} \]

Expanding each row explicitly:

\[ \begin{aligned} \text{Alex}_{\text{info-sichuan}} &= 0.929\times2.031 + 0.046\times0.404 + 0.024\times0.526 = 1.909 \\ \text{Alex}_{\text{info-budget}} &= 0.929\times1.303 + 0.046\times1.217 + 0.024\times1.320 = 1.270 \\ \text{Alex}_{\text{info-upscale}} &= 0.929\times0.479 + 0.046\times1.189 + 0.024\times1.949 = 0.533 \\[6pt] \text{Emma}_{\text{info-sichuan}} &= 0.426\times2.031 + 0.186\times0.404 + 0.388\times0.526 = 1.142 \\ \text{Emma}_{\text{info-budget}} &= 0.426\times1.303 + 0.186\times1.217 + 0.388\times1.320 = 1.292 \\ \text{Emma}_{\text{info-upscale}} &= 0.426\times0.479 + 0.186\times1.189 + 0.388\times1.949 = 1.068 \\[6pt] \text{Daniel}_{\text{info-sichuan}} &= 0.118\times2.031 + 0.138\times0.404 + 0.744\times0.526 = 0.697 \\ \text{Daniel}_{\text{info-budget}} &= 0.118\times1.303 + 0.138\times1.217 + 0.744\times1.320 = 1.310 \\ \text{Daniel}_{\text{info-upscale}} &= 0.118\times0.479 + 0.138\times1.189 + 0.744\times1.949 = 1.666 \end{aligned} \]

Alex [1.909, 1.270, 0.533] — output dominated by info-sichuan (spicy/fast). He paid 92.9% attention to himself, and his own V is strongest in info-sichuan.
Emma [1.142, 1.292, 1.068] — output spread across all three dimensions. She split attention between Alex and Daniel, whose V vectors complement each other.
Daniel [0.697, 1.310, 1.666] — output dominated by info-upscale (cozy/vibe). He paid 74.4% attention to himself, and his own V is strongest in info-upscale.

The output is a weighted blend of everyone's real information. Nobody is ignored entirely — even the lowest-weight person still contributes a little. This is why it's called soft attention, not hard selection.

7. V and output — the same knowledge, twice

V is the first decoding — each person's raw content extracted from their profile. Output is the second — the same content, redistributed by attention weights.

V — raw content, before blending:

\[ V = \begin{bmatrix} & \text{info-sichuan} & \text{info-budget} & \text{info-upscale} \\ \text{Alex} & 2.031 & 1.303 & 0.479 \\ \text{Emma} & 0.404 & 1.217 & 1.189 \\ \text{Daniel} & 0.526 & 1.320 & 1.949 \end{bmatrix} \]

Output — content after attention reweighting:

\[ \text{Output} = \begin{bmatrix} & \text{info-sichuan} & \text{info-budget} & \text{info-upscale} \\ \text{Alex} & 1.909 & 1.270 & 0.533 \\ \text{Emma} & 1.142 & 1.292 & 1.068 \\ \text{Daniel} & 0.697 & 1.310 & 1.666 \end{bmatrix} \]

Alex — output stays close to his own V (92.9% self-attention). His craving aligned so strongly with his own knowledge that he barely needed anyone else.
Emma — output shifts noticeably. She drew from both Alex and Daniel, so her final content is more balanced than her raw V.
Daniel — output stays close to his own V (74.4% self-attention). His upscale craving matched his own upscale content.

Attention doesn't invent anything. It just decides how to mix what's already there.

The full picture

Project each token into its current intent

Q = X @ W_Q — decode my features into craving strengths

Project each token into what it knows about

K = X @ W_K — decode my features into knowledge strengths

Project each token into its actual content

V = X @ W_V — extract the actual content from my features

Q @ Kᵀ / √d → alignment scores

How well does each person's knowledge match the query?

softmax → attention weights

Amplify the gaps, turn scores into a probability distribution

∑

Weights @ V → output

Blend everyone's real content in proportion to how well they matched

Alex ends up with a recommendation dominated by Sichuan info — his craving aligned so strongly with his own knowledge that attention let him mostly listen to himself. Daniel gets the upscale recommendation for the same reason. Emma, with a balanced profile, draws from both sides and gets a more even blend.

Nobody was ignored. Nobody dominated unfairly. Attention allocated influence in proportion to alignment — three people, one shared pool of knowledge, each walking away with the recommendation that fit them best.