Attention and the transformer
Date
Monday, March 2, 2026Links of interest
Notes
In class, we introduced the attention mechanism — the core innovation behind the transformer — starting from embeddings and cosine similarity, then identifying the bottleneck in RNNs that attention resolves.
Below are key concepts from this lecture:
Embeddings (recap)
An embedding is a vector representation of an object (a word, an image, a point on Earth, etc.). Objects that appear in similar contexts are placed near each other in embedding space.
To create embeddings you train a model — Word2Vec for words, a CNN for images. The result is a continuous, high-dimensional space where geometric relationships reflect semantic ones.
Cosine similarity
To compare two embeddings, we measure the angle $\theta$ between their vectors, regardless of magnitude:
$$\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}|\,|\mathbf{b}|}$$
- $\cos(\theta) \approx 1$: vectors nearly parallel — semantically similar.
- $\cos(\theta) \approx 0$: vectors orthogonal — unrelated.
Example: “earthquake damage on the coast” vs. “seismic destruction near shore” → $\cos(\theta) \approx 1$. The same phrase vs. “prolonged drought and water shortage” → $\cos(\theta) \approx 0$.
The information bottleneck
Standard RNNs maintain a fixed-size hidden state updated at every step:
$$\mathbf{h}t = f\!\left(\mathbf{h}{t-1},\, \mathbf{x}_t\right)$$
Every new observation must be squeezed into the same vector, so early observations are progressively diluted. The entire history of a long sequence is compressed into a single vector before a prediction is made.
Analogy: a geologist reading a rock core one centimeter at a time, with room for only one summary note, cannot preserve the full stratigraphy.
Attention: query, key, and value
Rather than summarizing into a bottleneck, attention looks at the whole sequence and decides what to focus on. Every position in the sequence receives three learned representations via weight matrices $W_Q$, $W_K$, $W_V$:
- Query (Q): “What am I looking for?”
- Key (K): “What do I contain?”
- Value (V): “What do I contribute?”
Computing attention
Four steps:
- Score: compatibility between two positions = dot product of their Q and K. High dot product → high weight.
- Normalize: apply softmax to the scores so they sum to 1 (an attention distribution).
- Retrieve: compute value vectors $\mathbf{v}_i = W_V\, \mathbf{x}_i$ for every position.
- Output: weighted sum of all value vectors using the attention weights: $$\text{output} = \sum_i \alpha_i\, \mathbf{v}_i$$
What is learned: an abstract strategy for looking, containing, and contributing — computed fresh from each new input at inference time.
Multi-head attention
A single attention head asks one kind of question. Multi-head attention runs several heads in parallel, each with its own $W_Q$, $W_K$, $W_V$.
Each head can learn to attend to a different type of relationship (e.g., proximity, semantic similarity, syntactic structure). Outputs are concatenated and projected back to the original dimension.
Specialization — different heads attending to different relationships — emerges from training, not from explicit design.