Attention and the transformer

Date

Monday, March 2, 2026

Links of interest

Notes

In class, we introduced the attention mechanism — the core innovation behind the transformer — starting from embeddings and cosine similarity, then identifying the bottleneck in RNNs that attention resolves.

Below are key concepts from this lecture:

Embeddings (recap)

An embedding is a vector representation of an object (a word, an image, a point on Earth, etc.). Objects that appear in similar contexts are placed near each other in embedding space.

To create embeddings you train a model — Word2Vec for words, a CNN for images. The result is a continuous, high-dimensional space where geometric relationships reflect semantic ones.

Cosine similarity

To compare two embeddings, we measure the angle $\theta$ between their vectors, regardless of magnitude:

$$\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}|\,|\mathbf{b}|}$$

$\cos(\theta) \approx 1$: vectors nearly parallel — semantically similar.
$\cos(\theta) \approx 0$: vectors orthogonal — unrelated.

Example: “earthquake damage on the coast” vs. “seismic destruction near shore” → $\cos(\theta) \approx 1$. The same phrase vs. “prolonged drought and water shortage” → $\cos(\theta) \approx 0$.

The information bottleneck

Standard RNNs maintain a fixed-size hidden state updated at every step:

$$\mathbf{h}t = f\!\left(\mathbf{h}{t-1},\, \mathbf{x}_t\right)$$

Every new observation must be squeezed into the same vector, so early observations are progressively diluted. The entire history of a long sequence is compressed into a single vector before a prediction is made.

Analogy: a geologist reading a rock core one centimeter at a time, with room for only one summary note, cannot preserve the full stratigraphy.

Attention: query, key, and value

Rather than summarizing into a bottleneck, attention looks at the whole sequence and decides what to focus on. Every position in the sequence receives three learned representations via weight matrices $W_Q$, $W_K$, $W_V$:

Query (Q): “What am I looking for?”
Key (K): “What do I contain?”
Value (V): “What do I contribute?”

Computing attention

Four steps:

Score: compatibility between two positions = dot product of their Q and K. High dot product → high weight.
Normalize: apply softmax to the scores so they sum to 1 (an attention distribution).
Retrieve: compute value vectors $\mathbf{v}_i = W_V\, \mathbf{x}_i$ for every position.
Output: weighted sum of all value vectors using the attention weights: $$\text{output} = \sum_i \alpha_i\, \mathbf{v}_i$$

What is learned: an abstract strategy for looking, containing, and contributing — computed fresh from each new input at inference time.

Multi-head attention

A single attention head asks one kind of question. Multi-head attention runs several heads in parallel, each with its own $W_Q$, $W_K$, $W_V$.

Each head can learn to attend to a different type of relationship (e.g., proximity, semantic similarity, syntactic structure). Outputs are concatenated and projected back to the original dimension.

Specialization — different heads attending to different relationships — emerges from training, not from explicit design.