Neural networks: more architectures

Date

Wednesday, February 18, 2026

Links of interest

Notes

In class, we briefly recapped the perceptron, then spent the bulk of the lecture on two new architectures: convolutional neural networks (CNNs) for gridded/image data and recurrent neural networks (RNNs) for sequential data.

Convolutional Neural Networks

Image segmentation.
Before introducing CNNs, we discussed three types of image segmentation — a family of tasks CNNs are often applied to:
- Semantic segmentation: Assign a class label to every pixel in an image.
- Instance segmentation: Identify and delineate each individual object in an image.
- Panoptic segmentation: A combination of semantic and instance segmentation.
The naive approach: flattening.
One way to apply a neural network to an image is to flatten all pixels into a 1-D vector and feed it into a fully connected (FC) network. The problems with this approach:
- It destroys spatial structure — nearby pixels, which tend to carry related information, are no longer treated as neighbors.
- It scales poorly: a $100 \times 100$ image produces 10,000 inputs, each connected to every neuron in the next layer.
The convolution (cross-correlation).
A better approach is to exploit the spatial structure of images directly. We define a small kernel (or filter) and slide it across the image, computing a dot product at each position:
$$(\mathbf{I} * \mathbf{K}){i,j} = \sum{m}\sum_{n} K_{m,n}, I_{i+m,, j+n}$$
- The kernel has learned weights (not pre-defined).
- Each kernel position produces one output value; sliding the kernel across the entire input produces a feature map.
- An activation function (typically ReLU) is applied to the feature map.
- A single layer can learn multiple kernels, each detecting a different spatial pattern, producing multiple feature maps.
Pooling (subsampling).
After convolution, we apply pooling to reduce the spatial dimensions of a feature map while retaining the most salient information. The most common variant is max pooling: a small window (e.g., $2 \times 2$ with stride 2) slides over the feature map and keeps only the maximum value in each region.
- This halves the spatial dimensions at each pooling step.
- It introduces a degree of translation invariance.
Putting it all together: the CNN architecture.
A typical CNN stacks these operations in sequence, followed by a classification head:
$$\text{Input} \xrightarrow{\text{conv + ReLU}} \text{feature maps} \xrightarrow{\text{pool}} \cdots \xrightarrow{\text{flatten or GAP}} \text{FC layers} \xrightarrow{} \text{output}$$
The conv + pool blocks handle feature extraction — learning what patterns are present and where. The FC layers handle classification (or regression) — mapping features to predictions.
Global average pooling (GAP).
As an alternative to flattening, global average pooling compresses each feature map to a single scalar by averaging all of its spatial values. If there are $K$ feature maps, GAP produces a vector of $K$ values, which is then passed to the output layer. GAP dramatically reduces the number of parameters compared to a fully connected head.

Recurrent Neural Networks

Motivation: sequential data.
A feed-forward network treats each input independently — it has no notion of order or temporal context. But many Earth science problems involve sequences where the history matters (e.g., a time series of temperature, a seismic waveform). One solution: give the network a memory.
The hidden state.
In an RNN, the network maintains a hidden state $\mathbf{h}_t$ that is updated at every time step. The hidden state receives two inputs: the current observation $\mathbf{x}t$ and the previous hidden state $\mathbf{h}{t-1}$:
$$\mathbf{h}_t = f!\left(\mathbf{W}h,\mathbf{h}{t-1} + \mathbf{W}_x,\mathbf{x}_t + \mathbf{b}\right)$$
The weight matrices $\mathbf{W}_h$ and $\mathbf{W}_x$ are shared across all time steps — the same parameters are applied at every step.
Unrolling the RNN.
The recurrent loop can be unrolled over time, making the information flow explicit:
$$\cdots \to h_{t-1} \to h_t \to h_{t+1} \to \cdots$$
At each step, the hidden state is updated and an output $\hat{y}_t$ can be produced.
What RNNs can do.
RNNs support several input-output configurations, each useful in different Earth science contexts:
- One-to-many: Single input → generate a sequence (e.g., initial ocean state → multi-year climate simulation).
- Many-to-one: Sequence → single output (e.g., seismic waveform → earthquake/noise classification).
- Many-to-many: Sequence in, sequence out of the same length (e.g., hourly meteorological inputs → hourly precipitation predictions).
- Seq2seq: Read $N$ steps, forecast $M$ steps ahead (e.g., past month of sea surface temperatures → next week’s forecast).