Neural networks: more architectures

Date

Wednesday, February 18, 2026

Notes

In class, we briefly recapped the perceptron, then spent the bulk of the lecture on two new architectures: convolutional neural networks (CNNs) for gridded/image data and recurrent neural networks (RNNs) for sequential data.


Convolutional Neural Networks

  1. Image segmentation.

    Before introducing CNNs, we discussed three types of image segmentation — a family of tasks CNNs are often applied to:

    • Semantic segmentation: Assign a class label to every pixel in an image.
    • Instance segmentation: Identify and delineate each individual object in an image.
    • Panoptic segmentation: A combination of semantic and instance segmentation.
  2. The naive approach: flattening.

    One way to apply a neural network to an image is to flatten all pixels into a 1-D vector and feed it into a fully connected (FC) network. The problems with this approach:

    • It destroys spatial structure — nearby pixels, which tend to carry related information, are no longer treated as neighbors.
    • It scales poorly: a $100 \times 100$ image produces 10,000 inputs, each connected to every neuron in the next layer.
  3. The convolution (cross-correlation).

    A better approach is to exploit the spatial structure of images directly. We define a small kernel (or filter) and slide it across the image, computing a dot product at each position:

    $$(\mathbf{I} * \mathbf{K}){i,j} = \sum{m}\sum_{n} K_{m,n}, I_{i+m,, j+n}$$

    • The kernel has learned weights (not pre-defined).
    • Each kernel position produces one output value; sliding the kernel across the entire input produces a feature map.
    • An activation function (typically ReLU) is applied to the feature map.
    • A single layer can learn multiple kernels, each detecting a different spatial pattern, producing multiple feature maps.
  4. Pooling (subsampling).

    After convolution, we apply pooling to reduce the spatial dimensions of a feature map while retaining the most salient information. The most common variant is max pooling: a small window (e.g., $2 \times 2$ with stride 2) slides over the feature map and keeps only the maximum value in each region.

    • This halves the spatial dimensions at each pooling step.
    • It introduces a degree of translation invariance.
  5. Putting it all together: the CNN architecture.

    A typical CNN stacks these operations in sequence, followed by a classification head:

    $$\text{Input} \xrightarrow{\text{conv + ReLU}} \text{feature maps} \xrightarrow{\text{pool}} \cdots \xrightarrow{\text{flatten or GAP}} \text{FC layers} \xrightarrow{} \text{output}$$

    The conv + pool blocks handle feature extraction — learning what patterns are present and where. The FC layers handle classification (or regression) — mapping features to predictions.

  6. Global average pooling (GAP).

    As an alternative to flattening, global average pooling compresses each feature map to a single scalar by averaging all of its spatial values. If there are $K$ feature maps, GAP produces a vector of $K$ values, which is then passed to the output layer. GAP dramatically reduces the number of parameters compared to a fully connected head.


Recurrent Neural Networks

  1. Motivation: sequential data.

    A feed-forward network treats each input independently — it has no notion of order or temporal context. But many Earth science problems involve sequences where the history matters (e.g., a time series of temperature, a seismic waveform). One solution: give the network a memory.

  2. The hidden state.

    In an RNN, the network maintains a hidden state $\mathbf{h}_t$ that is updated at every time step. The hidden state receives two inputs: the current observation $\mathbf{x}t$ and the previous hidden state $\mathbf{h}{t-1}$:

    $$\mathbf{h}_t = f!\left(\mathbf{W}h,\mathbf{h}{t-1} + \mathbf{W}_x,\mathbf{x}_t + \mathbf{b}\right)$$

    The weight matrices $\mathbf{W}_h$ and $\mathbf{W}_x$ are shared across all time steps — the same parameters are applied at every step.

  3. Unrolling the RNN.

    The recurrent loop can be unrolled over time, making the information flow explicit:

    $$\cdots \to h_{t-1} \to h_t \to h_{t+1} \to \cdots$$

    At each step, the hidden state is updated and an output $\hat{y}_t$ can be produced.

  4. What RNNs can do.

    RNNs support several input-output configurations, each useful in different Earth science contexts:

    • One-to-many: Single input → generate a sequence (e.g., initial ocean state → multi-year climate simulation).
    • Many-to-one: Sequence → single output (e.g., seismic waveform → earthquake/noise classification).
    • Many-to-many: Sequence in, sequence out of the same length (e.g., hourly meteorological inputs → hourly precipitation predictions).
    • Seq2seq: Read $N$ steps, forecast $M$ steps ahead (e.g., past month of sea surface temperatures → next week’s forecast).