Neural networks: continued

Date

Monday, February 23, 2026

Links of interest

Notes

In class, we covered CNN feature hierarchies and the encoder-decoder architecture, then worked through the LSTM in detail as a solution to the vanishing gradient problem in RNNs.

Below are key concepts from this lecture:

Feature hierarchies in CNNs

As a CNN deepens, its representations become progressively more abstract:

Shallow layers detect low-level, spatially local patterns: edges, oriented gradients, colour blobs, simple textures.
Mid layers combine these into part-level features: corners, junctions, repeating textures, object parts.
Deep layers encode high-level semantic concepts with large receptive fields — for example, object classes.

Encoder-decoder architecture

By encoding an input into a compact representation and then decoding it, a network learns which features best describe the data.

The encoder compresses the input.
The decoder reconstructs or transforms it.

This architecture underlies tasks such as image segmentation.

U-Net

U-Net is an encoder-decoder designed for image-to-mask tasks. Its key innovation is skip connections: feature maps from each encoder stage are concatenated directly into the matching decoder stage.

This lets the decoder use both high-level semantic information from the bottleneck and low-level spatial detail from earlier encoder layers.

U-Net was developed for biomedical segmentation and is widely used in the geosciences (e.g., seismic facies segmentation).

The vanishing gradient problem

Standard RNNs struggle with long sequences. During backpropagation through time, gradients are multiplied by the same weight matrix at every step. When those weights are smaller than 1, the gradients shrink exponentially — the vanishing gradient problem.

In practice, the network cannot learn dependencies separated by many time steps.

Long short-term memory (LSTM)

The LSTM addresses the vanishing gradient by maintaining a cell state $c_t$ — a vector that flows across time steps via element-wise operations only. Because no weight matrix multiplies along this path, gradients can propagate through long sequences without shrinking.

The cell state acts as a conveyor belt of memory.

LSTM gates

Three learned gates control how the cell state is updated at each time step:

Forget gate: how much of the old cell state to keep. $$f_t = \sigma\!\left(W_f[h_{t-1}, x_t] + b_f\right)$$
Input gate + candidate: how much new information to write. $$i_t = \sigma\!\left(W_i[h_{t-1}, x_t] + b_i\right), \quad \tilde{c}t = \tanh\!\left(W_c[h{t-1}, x_t] + b_c\right)$$
Output gate: which part of the updated cell state to expose. $$o_t = \sigma\!\left(W_o[h_{t-1}, x_t] + b_o\right), \quad h_t = o_t \otimes \tanh(c_t)$$

Cell state update

The cell state combines a forgotten version of the past with new candidate information:

$$c_t = f_t \otimes c_{t-1} + i_t \otimes \tilde{c}_t$$

This linear path is what allows gradients to flow without vanishing — the core advantage of the LSTM over a standard RNN.