Neural networks: worked examples
Date
Wednesday, February 25, 2026Links of interest
Notes
In class, we worked through two concrete examples — one RNN and one CNN — to make the mechanics of these architectures tangible.
Below are the key steps from each example:
RNN example: the hello task
Task: given one letter of the word “hello” at a time, predict the next letter.
The network processes one letter per time step and uses its hidden state to carry information from previous steps.
At each step: $$h^{(t)} = \tanh\!\left(W_{xh}\, x^{(t)} + W_{hh}\, h^{(t-1)}\right), \qquad \hat{y}^{(t)} = W_{hy}\, h^{(t)}$$
One-hot encoding
The four unique letters {h, e, l, o} are each represented as a vector with a single 1 and all other entries 0:
$$x^{\mathtt{h}} = [1,0,0,0]^T, \quad x^{\mathtt{e}} = [0,1,0,0]^T, \quad x^{\mathtt{l}} = [0,0,1,0]^T, \quad x^{\mathtt{o}} = [0,0,0,1]^T$$
The three weight matrices serve distinct roles: $W_{xh}$ maps the current input to the hidden layer, $W_{hh}$ carries the previous hidden state forward, and $W_{hy}$ maps the hidden state to the output.
The hidden-state problem
If the hidden layer has as many neurons as there are letters, each neuron can simply learn to “be” a letter.
After processing “hel,” the hidden state $[0,0,1,0]$ is identical to the state produced by a lone “l.” The network encodes what letter — not where in the sequence — so it incorrectly predicts the next letter after the first “l” in “hello.”
Solution: distributed representations
Using fewer hidden neurons than vocabulary items (e.g., 3 instead of 4) forces the network to distribute information across neurons. No single unit represents a single letter.
After “hel,” the hidden state differs from the state after a lone “l,” because sequential context is encoded collectively. This distributed representation is learned through training.
CNN example: classifying a handwritten digit
Input: an $8 \times 8$ pixel grid where each pixel is 0 (background) or 1 (digit).
Filter (kernel): a small $3 \times 3$ grid of learned weights. A vertical edge detector kernel has the form $[-1, 0, +1;\; -1, 0, +1;\; -1, 0, +1]$.
Convolution: place the filter at each patch of the image, compute the dot product, and collect the outputs into a feature map. An $8 \times 8$ input with a $3 \times 3$ filter yields a $6 \times 6$ feature map.
Activation and pooling
ReLU: $\mathrm{ReLU}(x) = \max(0, x)$, applied element-wise after convolution, sets negative values to zero.
Max pooling: a $2 \times 2$ window takes the largest value in each non-overlapping region, compressing the $6 \times 6$ feature map to $3 \times 3$. This reduces spatial dimensions and introduces translation invariance.
Full CNN pipeline
Stacking convolution, activation, and pooling gives a complete classifier:
$$\text{Input} \to \text{conv} + \text{ReLU} \to \text{pool} \to \text{conv} + \text{ReLU} \to \text{pool} \to \text{FC layers} \to \text{digit scores}$$
Each filter produces one feature map. A network learns many filters in parallel, each attending to a different spatial pattern.