Position Matters

Frank and Alex have identical preferences. Without positional encoding, the model can't tell them apart.

1. The problem — attention has no sense of order

Look at how attention computes its output:

Q = XW_Q, K = XW_K, V = XW_V

Output = softmax( QKᵀ / √d ) V

Every operation here is row-wise. Q[i] is computed from X[i] alone. The alignment score between person i and person j is Q[i] · K[j] — it only depends on their feature vectors, not where they sit in the sequence.

If you shuffled the rows of X, you'd get a shuffled Q, a shuffled K, a shuffled V — and a shuffled output. The relative scores between any two people would be identical. The model has no way to know that person 1 spoke before person 4.

This is not a bug — it is a direct consequence of how matrix multiplication works. But for language, order is everything.

"Alex recommended the restaurant Emma doesn't like" and "Emma recommended the restaurant Alex doesn't like" use the exact same words. Swap the positions, swap the meaning. Attention, on its own, cannot make that distinction.

2. Frank joins the table

We now have four people seated in order:

Person	Position	Profile
Frank	pos = 1	spicy cheap fast
Emma	pos = 2	cheap cozy
Daniel	pos = 3	flavor vibe cozy
Alex	pos = 4	spicy cheap fast

Frank is new. And he happens to have exactly the same preference profile as Alex — same taste, same priorities, same everything:

\[ X = \begin{bmatrix} & \text{spicy} & \text{cheap} & \text{cozy} & \text{fast} & \text{flavor} & \text{vibe} \\ \text{Frank} & 0.98 & 0.95 & 0.12 & 0.97 & 0.15 & 0.08 \\ \text{Emma} & 0.11 & 0.96 & 0.94 & 0.09 & 0.13 & 0.18 \\ \text{Daniel} & 0.14 & 0.17 & 0.92 & 0.11 & 0.96 & 0.95 \\ \text{Alex} & 0.98 & 0.95 & 0.12 & 0.97 & 0.15 & 0.08 \end{bmatrix} \]

Row 1 and row 4 are identical.

Now run them through attention. Since the rows are the same, and W_Q, W_K, W_V are the same matrices applied to both:

\[ Q_{\text{Frank}} = Q_{\text{Alex}}, \quad K_{\text{Frank}} = K_{\text{Alex}}, \quad V_{\text{Frank}} = V_{\text{Alex}} \]

Every alignment score involving Frank is identical to the corresponding score involving Alex. Their softmax weights are identical. Their outputs are identical.

The model sees two people at different positions in the sequence — one who spoke first, one who spoke last — and produces the exact same representation for both. Position 1 and position 4 are indistinguishable. Frank might as well not exist.

3. The fix — add a position label to every row

The solution is straightforward: before X enters attention, add a position-specific vector to each row.

\[ X' = X + P \]

P is the positional encoding matrix — same shape as X, one row per position. Each row encodes where in the sequence that person sits. The addition happens element-wise: every feature dimension of every person gets a small positional nudge.

After the addition, Frank and Alex's rows are no longer identical:

\[ X'_{\text{Frank}} = X_{\text{Frank}} + P_1, \quad X'_{\text{Alex}} = X_{\text{Alex}} + P_4 \]

Since \(X_{\text{Frank}} = X_{\text{Alex}}\) but \(P_1 \neq P_4\), we get \(X'_{\text{Frank}} \neq X'_{\text{Alex}}\). From this point forward, every downstream computation — Q, K, V, alignment scores, attention weights, output — will differ between Frank and Alex.

Three requirements a good P needs to satisfy:

Unique per position — P₁ ≠ P₂ ≠ P₃ ≠ ... so every position produces a distinct X'

Consistent across sequences — position 4 always means the same thing, regardless of what's in the sequence

Works for any length — the model shouldn't need to know in advance how long the sequence will be

The original Transformer paper's answer to all three: sine and cosine functions at different frequencies.

4. Designing P — sine and cosine

The positional encoding for position pos and dimension i is:

\[ P_{pos,\ 2i} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad P_{pos,\ 2i+1} = \cos\left(\frac{pos}{10000^{2i/d}}\right) \]

where d = 6 is the feature dimension, and i = 0, 1, 2 indexes the three sin/cos pairs. Working out each dimension:

\[ P_{pos,0} = \sin(pos), \quad P_{pos,1} = \cos(pos) \] \[ P_{pos,2} = \sin\!\left(\frac{pos}{21.5}\right), \quad P_{pos,3} = \cos\!\left(\frac{pos}{21.5}\right) \] \[ P_{pos,4} = \sin\!\left(\frac{pos}{464}\right), \quad P_{pos,5} = \cos\!\left(\frac{pos}{464}\right) \]

Three pairs, three frequencies. Plugging in pos = 1, 2, 3, 4:

\[ P = \begin{bmatrix} & d_0 & d_1 & d_2 & d_3 & d_4 & d_5 \\ \text{pos=1 (Frank)} & 0.841 & 0.540 & 0.047 & 0.999 & 0.002 & 1.000 \\ \text{pos=2 (Emma)} & 0.909 & -0.416 & 0.093 & 0.996 & 0.004 & 1.000 \\ \text{pos=3 (Daniel)} & 0.141 & -0.990 & 0.139 & 0.990 & 0.006 & 1.000 \\ \text{pos=4 (Alex)} & -0.757 & -0.654 & 0.185 & 0.983 & 0.009 & 1.000 \end{bmatrix} \]

d₀, d₁ — high frequency

Changes rapidly across positions. Easy to tell pos=1 from pos=2. Good for nearby distinctions — like the seconds hand on a clock.

d₂, d₃ — medium frequency

Changes slowly. Useful for encoding relative distance across longer sequences. Good for medium-range structure.

d₄, d₅ — low frequency

Barely moves across four positions. Across thousands of tokens, these dimensions encode long-range structure — like the hour hand on a clock.

The sine/cosine choice also has a geometric property: for any fixed offset k, the positional encoding at position pos+k can be expressed as a linear transformation of the encoding at position pos. This means the model can learn to attend to "the token k positions ago" — relative position — not just absolute position.

5. X' = X + P — Frank and Alex diverge

Adding P to X element-wise:

\[ X' = X + P = \begin{bmatrix} & d_0 & d_1 & d_2 & d_3 & d_4 & d_5 \\ \text{Frank} & 0.98+0.841 & 0.95+0.540 & 0.12+0.047 & 0.97+0.999 & 0.15+0.002 & 0.08+1.000 \\ \text{Emma} & 0.11+0.909 & 0.96-0.416 & 0.94+0.093 & 0.09+0.996 & 0.13+0.004 & 0.18+1.000 \\ \text{Daniel} & 0.14+0.141 & 0.17-0.990 & 0.92+0.139 & 0.11+0.990 & 0.96+0.006 & 0.95+1.000 \\ \text{Alex} & 0.98-0.757 & 0.95-0.654 & 0.12+0.185 & 0.97+0.983 & 0.15+0.009 & 0.08+1.000 \end{bmatrix} \]

\[ = \begin{bmatrix} & d_0 & d_1 & d_2 & d_3 & d_4 & d_5 \\ \text{Frank} & 1.821 & 1.490 & 0.167 & 1.969 & 0.152 & 1.080 \\ \text{Emma} & 1.019 & 0.544 & 1.033 & 1.086 & 0.134 & 1.180 \\ \text{Daniel} & 0.281 & -0.820 & 1.059 & 1.100 & 0.966 & 1.950 \\ \text{Alex} & 0.223 & 0.296 & 0.305 & 1.953 & 0.159 & 1.080 \end{bmatrix} \]

Frank and Alex started identical. Now compare them directly:

\[ X'_{\text{Frank}} = [1.821,\ \ 1.490,\ 0.167,\ 1.969,\ 0.152,\ 1.080] \] \[ X'_{\text{Alex}} = [0.223,\ \ 0.296,\ 0.305,\ 1.953,\ 0.159,\ 1.080] \]

d₀ and d₁ — the high-frequency pair — show the biggest divergence. Frank at pos=1 gets a large positive nudge (+0.841, +0.540). Alex at pos=4 gets a negative nudge (−0.757, −0.654). Their d₀ values go from identical (0.98) to completely different (1.821 vs 0.223).

d₄ and d₅ — barely move. At only four positions apart, the low-frequency pair hasn't had room to diverge yet. Across thousands of tokens, these dimensions would show meaningful separation.

6. What positional encoding changes downstream

X' now flows into attention exactly as X did in Part 1 and Part 2. Using the same W_Q, W_K, W_V, we compute all three projections for all four people.

Q' = X' @ W_Q

\[ \begin{aligned} Q'_{\text{Frank, street-stall}} &= 1.821\times0.97 + 1.490\times0.99 + 0.167\times0.12 + 1.969\times0.98 + 0.152\times0.13 + 1.080\times0.10 = 5.319 \\ Q'_{\text{Frank, upscale}} &= 1.821\times0.08 + 1.490\times0.11 + 0.167\times0.96 + 1.969\times0.09 + 0.152\times0.07 + 1.080\times0.98 = 1.716 \\[6pt] Q'_{\text{Emma, street-stall}} &= 1.019\times0.97 + 0.544\times0.99 + 1.033\times0.12 + 1.086\times0.98 + 0.134\times0.13 + 1.180\times0.10 = 2.850 \\ Q'_{\text{Emma, upscale}} &= 1.019\times0.08 + 0.544\times0.11 + 1.033\times0.96 + 1.086\times0.09 + 0.134\times0.07 + 1.180\times0.98 = 2.397 \\[6pt] Q'_{\text{Daniel, street-stall}} &= 0.281\times0.97 + (-0.820)\times0.99 + 1.059\times0.12 + 1.100\times0.98 + 0.966\times0.13 + 1.950\times0.10 = 0.987 \\ Q'_{\text{Daniel, upscale}} &= 0.281\times0.08 + (-0.820)\times0.11 + 1.059\times0.96 + 1.100\times0.09 + 0.966\times0.07 + 1.950\times0.98 = 3.027 \\[6pt] Q'_{\text{Alex, street-stall}} &= 0.223\times0.97 + 0.296\times0.99 + 0.305\times0.12 + 1.953\times0.98 + 0.159\times0.13 + 1.080\times0.10 = 2.589 \\ Q'_{\text{Alex, upscale}} &= 0.223\times0.08 + 0.296\times0.11 + 0.305\times0.96 + 1.953\times0.09 + 0.159\times0.07 + 1.080\times0.98 = 1.589 \end{aligned} \]

\[ Q' = \begin{bmatrix} & \text{street-stall} & \text{upscale} \\ \text{Frank} & 5.319 & 1.716 \\ \text{Emma} & 2.850 & 2.397 \\ \text{Daniel} & 0.987 & 3.027 \\ \text{Alex} & 2.589 & 1.589 \end{bmatrix} \]

Before positional encoding, Frank and Alex would have been identical at [2.885, 0.474]. Now Frank is [5.319, 1.716] and Alex is [2.589, 1.589] — completely different. Daniel's negative d₁ nudge at pos=3 pushes his upscale signal to 3.027 — the strongest upscale query at the table.

K' = X' @ W_K

\[ \begin{aligned} K'_{\text{Frank, knows-street-stall}} &= 1.821\times0.96 + 1.490\times0.98 + 0.167\times0.10 + 1.969\times0.97 + 0.152\times0.12 + 1.080\times0.09 = 5.271 \\ K'_{\text{Frank, knows-upscale}} &= 1.821\times0.09 + 1.490\times0.11 + 0.167\times0.97 + 1.969\times0.08 + 0.152\times0.96 + 1.080\times0.99 = 1.703 \\[6pt] K'_{\text{Emma, knows-street-stall}} &= 1.019\times0.96 + 0.544\times0.98 + 1.033\times0.10 + 1.086\times0.97 + 0.134\times0.12 + 1.180\times0.09 = 2.789 \\ K'_{\text{Emma, knows-upscale}} &= 1.019\times0.09 + 0.544\times0.11 + 1.033\times0.97 + 1.086\times0.08 + 0.134\times0.96 + 1.180\times0.99 = 2.538 \\[6pt] K'_{\text{Daniel, knows-street-stall}} &= 0.281\times0.96 + (-0.820)\times0.98 + 1.059\times0.10 + 1.100\times0.97 + 0.966\times0.12 + 1.950\times0.09 = 0.931 \\ K'_{\text{Daniel, knows-upscale}} &= 0.281\times0.09 + (-0.820)\times0.11 + 1.059\times0.97 + 1.100\times0.08 + 0.966\times0.96 + 1.950\times0.99 = 3.908 \\[6pt] K'_{\text{Alex, knows-street-stall}} &= 0.223\times0.96 + 0.296\times0.98 + 0.305\times0.10 + 1.953\times0.97 + 0.159\times0.12 + 1.080\times0.09 = 2.565 \\ K'_{\text{Alex, knows-upscale}} &= 0.223\times0.09 + 0.296\times0.11 + 0.305\times0.97 + 1.953\times0.08 + 0.159\times0.96 + 1.080\times0.99 = 1.577 \end{aligned} \]

\[ K' = \begin{bmatrix} & \text{knows-street-stall} & \text{knows-upscale} \\ \text{Frank} & 5.271 & 1.703 \\ \text{Emma} & 2.789 & 2.538 \\ \text{Daniel} & 0.931 & 3.908 \\ \text{Alex} & 2.565 & 1.577 \end{bmatrix} \]

Frank — massive street-stall broadcast (5.271). His pos=1 nudge amplified everything.
Emma — balanced (2.789 vs 2.538). Her pos=2 nudge is moderate in both directions.
Daniel — strongly upscale (3.908). His pos=3 negative d₁ suppressed street-stall and lifted upscale.
Alex — street-stall dominant (2.565 vs 1.577), same profile as Frank but positionally dampened.

V' = X' @ W_V

\[ \begin{aligned} V'_{\text{Frank, info-sichuan}} &= 1.821\times0.97 + 1.490\times0.10 + 0.167\times0.09 + 1.969\times0.98 + 0.152\times0.11 + 1.080\times0.08 = 3.963 \\ V'_{\text{Frank, info-budget}} &= 1.821\times0.11 + 1.490\times0.98 + 0.167\times0.12 + 1.969\times0.10 + 0.152\times0.97 + 1.080\times0.09 = 2.121 \\ V'_{\text{Frank, info-upscale}} &= 1.821\times0.09 + 1.490\times0.08 + 0.167\times0.96 + 1.969\times0.11 + 0.152\times0.09 + 1.080\times0.99 = 1.743 \\[6pt] V'_{\text{Emma, info-sichuan}} &= 1.019\times0.97 + 0.544\times0.10 + 1.033\times0.09 + 1.086\times0.98 + 0.134\times0.11 + 1.180\times0.08 = 2.308 \\ V'_{\text{Emma, info-budget}} &= 1.019\times0.11 + 0.544\times0.98 + 1.033\times0.12 + 1.086\times0.10 + 0.134\times0.97 + 1.180\times0.09 = 1.114 \\ V'_{\text{Emma, info-upscale}} &= 1.019\times0.09 + 0.544\times0.08 + 1.033\times0.96 + 1.086\times0.11 + 0.134\times0.09 + 1.180\times0.99 = 2.427 \\[6pt] V'_{\text{Daniel, info-sichuan}} &= 0.281\times0.97 + (-0.820)\times0.10 + 1.059\times0.09 + 1.100\times0.98 + 0.966\times0.11 + 1.950\times0.08 = 1.626 \\ V'_{\text{Daniel, info-budget}} &= 0.281\times0.11 + (-0.820)\times0.98 + 1.059\times0.12 + 1.100\times0.10 + 0.966\times0.97 + 1.950\times0.09 = 0.577 \\ V'_{\text{Daniel, info-upscale}} &= 0.281\times0.09 + (-0.820)\times0.08 + 1.059\times0.96 + 1.100\times0.11 + 0.966\times0.09 + 1.950\times0.99 = 3.115 \\[6pt] V'_{\text{Alex, info-sichuan}} &= 0.223\times0.97 + 0.296\times0.10 + 0.305\times0.09 + 1.953\times0.98 + 0.159\times0.11 + 1.080\times0.08 = 2.290 \\ V'_{\text{Alex, info-budget}} &= 0.223\times0.11 + 0.296\times0.98 + 0.305\times0.12 + 1.953\times0.10 + 0.159\times0.97 + 1.080\times0.09 = 0.798 \\ V'_{\text{Alex, info-upscale}} &= 0.223\times0.09 + 0.296\times0.08 + 0.305\times0.96 + 1.953\times0.11 + 0.159\times0.09 + 1.080\times0.99 = 1.635 \end{aligned} \]

\[ V' = \begin{bmatrix} & \text{info-sichuan} & \text{info-budget} & \text{info-upscale} \\ \text{Frank} & 3.963 & 2.121 & 1.743 \\ \text{Emma} & 2.308 & 1.114 & 2.427 \\ \text{Daniel} & 1.626 & 0.577 & 3.115 \\ \text{Alex} & 2.290 & 0.798 & 1.635 \end{bmatrix} \]

Frank — the large positive nudge at pos=1 amplified his info-sichuan signal to 3.963, nearly double the original 2.031 from Part 1.
Daniel — the negative d₁ nudge at pos=3 suppressed info-sichuan (1.626) but strongly amplified info-upscale (3.115).
Alex — the negative nudge at pos=4 pulled info-sichuan down to 2.290 and info-budget down to 0.798.
Frank and Alex now have completely different V' vectors — same knowledge, different positions, different contributions.

Let's look at V specifically, since it carries the actual content that gets blended into the output.

Without positional encoding: Frank and Alex share the same X row, so:

\[ V_{\text{Frank}} = X_{\text{Frank}} \cdot W_V = X_{\text{Alex}} \cdot W_V = V_{\text{Alex}} \]

Their content contributions to the output are identical. The model cannot distinguish whose knowledge is whose.

With positional encoding: Frank and Alex now have different X' rows, so:

\[ V'_{\text{Frank}} = X'_{\text{Frank}} \cdot W_V = (X_{\text{Frank}} + P_1) \cdot W_V \] \[ V'_{\text{Alex}} = X'_{\text{Alex}} \cdot W_V = (X_{\text{Alex}} + P_4) \cdot W_V \]

Since \(P_1 \neq P_4\), we get \(V'_{\text{Frank}} \neq V'_{\text{Alex}}\). The same knowledge, held by two people in different positions, now contributes differently to the final blend.

This is a subtle but important point: V is supposed to carry "what I actually know." Frank and Alex know the same things — same restaurants, same dishes, same wait times. But after positional encoding, their V vectors differ, because position has been mixed into the representation.

Think of it this way: Frank speaks first, before anyone else has said anything — his recommendation lands in an empty room and anchors the conversation. Alex says the exact same thing last, after Emma and Daniel have already weighed in — the context is completely different, and so is the impact. Same knowledge, different position, different meaning. Positional encoding bakes that reality into the vectors.

7. The full picture

Positional encoding is not part of attention itself — it happens before attention, as a one-line preprocessing step. Everything in Part 1 and Part 2 still applies. The only difference is that the input is now X' instead of X.

Raw feature matrix

Each row is one token's embedding — no position information yet

Add positional encoding — X' = X + P

Each row gets a position-specific nudge. Frank and Alex diverge here.

Q = X'W_Q — decode cravings

Now position-aware — Frank and Alex produce different queries

K = X'W_K — decode knowledge

Position affects what knowledge signal each person broadcasts

V = X'W_V — extract content

Position affects what content each person contributes

∑

softmax(QKᵀ/√d) @ V → output

Attention now operates on position-aware representations

Two things worth noting before we move on:

Positional encoding is added, not concatenated. It would be natural to think "just append the position as an extra feature." But addition keeps the dimensionality the same — the rest of the architecture doesn't need to change. And because the model learns W_Q, W_K, W_V jointly with the positional signal already mixed in, it learns to disentangle content from position as needed.

The sine/cosine encoding is fixed, not learned. The original Transformer paper used the formula above — no training required. Later models like BERT learned positional embeddings instead, treating each position as a trainable vector. Both approaches work. The sine/cosine version has the advantage of generalizing to sequence lengths longer than anything seen during training.

Attention is powerful but blind to order. Positional encoding gives it eyes. X' = X + P is a single line — but it's the line that turns a bag of features into a sequence.

Frank and Alex walk in with identical profiles. They walk out of attention with different representations. The only thing that changed was where they sat.