Assembling the transformer

Date

Wednesday, March 4, 2026

Links of interest

Notes

In class, we assembled the full transformer architecture (Vaswani et al., 2017) component by component, then joined the encoder and decoder.

Below are key concepts from this lecture:

Tokenization and embedding

Two steps prepare any input for the transformer:

Tokenize: break the input into discrete units at fixed positions. For text: words or subwords. For other domains: image patches, audio windows, time series measurements.
Embed: map each token to a fixed-length vector of dimension $d_\text{model}$, producing a matrix of shape $\text{sequence_length} \times d_\text{model}$.

Each row encodes what a token is. At this stage the rows carry no information about where in the sequence they appear.

Positional encoding

To inject order, a positional vector of length $d_\text{model}$ is added to each embedding row:

$$\text{row}_i’ = \text{embedding}_i + \text{position}_i$$

Now each row carries both the identity of its token and its position in the sequence. The exact form of positional encoding varies across architectures.

Without positional encoding, “the dog bit the man” and “the man bit the dog” would be indistinguishable — both contain identical tokens.

One encoder layer: two sub-layers

Every encoder layer applies two operations in sequence:

Multi-head attention: the full embedding matrix goes in; a new matrix of the same shape comes out. Each row is now a blend of information drawn from across all other rows.
Feed-forward network (FFN): each row is passed independently through the same small two-layer MLP. Given what attention has assembled for a position, the FFN asks: what should our representation of this token become?

Attention communicates across rows; the FFN transforms within each row. Together: first mix, then transform.

Residual connections

Deep networks suffer from vanishing gradients: error signals shrink as they propagate backward, and early layers learn almost nothing.

A residual connection bypasses this by adding the layer’s input directly to its output:

$$\mathbf{y} = \mathbf{x} + \text{Sublayer}(\mathbf{x})$$

This creates a direct path for gradients to flow backward — early layers always receive a strong learning signal regardless of depth.

Layer normalization and the Add & Norm pattern

Layer normalization stabilizes training by rescaling each row to zero mean and unit variance, then applying learned scale $\boldsymbol{\gamma}$ and shift $\boldsymbol{\beta}$:

$$\text{LayerNorm}(\mathbf{x}) = \frac{\mathbf{x} - \mu}{\sigma + \epsilon} \cdot \boldsymbol{\gamma} + \boldsymbol{\beta}$$

Both sub-layers (attention and FFN) are wrapped in the Add & Norm pattern:

$$\mathbf{x} \leftarrow \text{LayerNorm}\!\left(\mathbf{x} + \text{Sublayer}(\mathbf{x})\right)$$

The encoder

The encoder runs the full input matrix through all $N$ layers once, in parallel, with no recurrence.

Output: a matrix of the same shape as the input — one contextualized vector per input token, encoding each token’s meaning in the context of the whole sequence.

This matrix is passed to the decoder.

The decoder

Training: the decoder receives the target sequence shifted right (prepended with <BOS>). A causal mask prevents each position from attending to future positions — position 3 can only see positions 1, 2, 3.

Inference (autoregressive generation):

Feed <BOS> → produce a probability distribution → sample token $t_1$.
Feed <BOS>, $t_1$ → sample token $t_2$.
Repeat until <EOS> is generated.

A final linear layer + softmax turns the decoder’s output matrix into a probability distribution over the vocabulary.