Neural networks: worked examples

Date

Wednesday, February 25, 2026

Notes

In class, we worked through two concrete examples — one RNN and one CNN — to make the mechanics of these architectures tangible.


Below are the key steps from each example:

RNN example: the hello task

Task: given one letter of the word “hello” at a time, predict the next letter.

The network processes one letter per time step and uses its hidden state to carry information from previous steps.

At each step: $$h^{(t)} = \tanh\!\left(W_{xh}\, x^{(t)} + W_{hh}\, h^{(t-1)}\right), \qquad \hat{y}^{(t)} = W_{hy}\, h^{(t)}$$

One-hot encoding

The four unique letters {h, e, l, o} are each represented as a vector with a single 1 and all other entries 0:

$$x^{\mathtt{h}} = [1,0,0,0]^T, \quad x^{\mathtt{e}} = [0,1,0,0]^T, \quad x^{\mathtt{l}} = [0,0,1,0]^T, \quad x^{\mathtt{o}} = [0,0,0,1]^T$$

The three weight matrices serve distinct roles: $W_{xh}$ maps the current input to the hidden layer, $W_{hh}$ carries the previous hidden state forward, and $W_{hy}$ maps the hidden state to the output.

The hidden-state problem

If the hidden layer has as many neurons as there are letters, each neuron can simply learn to “be” a letter.

After processing “hel,” the hidden state $[0,0,1,0]$ is identical to the state produced by a lone “l.” The network encodes what letter — not where in the sequence — so it incorrectly predicts the next letter after the first “l” in “hello.”

Solution: distributed representations

Using fewer hidden neurons than vocabulary items (e.g., 3 instead of 4) forces the network to distribute information across neurons. No single unit represents a single letter.

After “hel,” the hidden state differs from the state after a lone “l,” because sequential context is encoded collectively. This distributed representation is learned through training.

CNN example: classifying a handwritten digit

Input: an $8 \times 8$ pixel grid where each pixel is 0 (background) or 1 (digit).

Filter (kernel): a small $3 \times 3$ grid of learned weights. A vertical edge detector kernel has the form $[-1, 0, +1;\; -1, 0, +1;\; -1, 0, +1]$.

Convolution: place the filter at each patch of the image, compute the dot product, and collect the outputs into a feature map. An $8 \times 8$ input with a $3 \times 3$ filter yields a $6 \times 6$ feature map.

Activation and pooling

ReLU: $\mathrm{ReLU}(x) = \max(0, x)$, applied element-wise after convolution, sets negative values to zero.

Max pooling: a $2 \times 2$ window takes the largest value in each non-overlapping region, compressing the $6 \times 6$ feature map to $3 \times 3$. This reduces spatial dimensions and introduces translation invariance.

Full CNN pipeline

Stacking convolution, activation, and pooling gives a complete classifier:

$$\text{Input} \to \text{conv} + \text{ReLU} \to \text{pool} \to \text{conv} + \text{ReLU} \to \text{pool} \to \text{FC layers} \to \text{digit scores}$$

Each filter produces one feature map. A network learns many filters in parallel, each attending to a different spatial pattern.