Neural networks (beginning with the perceptron)

Date

Friday, February 13, 2026

Links of interest

Notes

In class, we introduced the perceptron — the simplest building block of a neural network — and extended it to a multi-layer architecture. We discussed:

The perceptron: structure.
A perceptron takes one or more inputs, applies weights and a bias, and passes the result through an activation function to produce an output:
$$z = \sum_i w_i x_i + b, \qquad \hat{y} = f(z)$$
Components:
- Inputs ($x_0, x_1, \ldots$): the features fed into the neuron.
- Weights ($w_0, w_1, \ldots$): learned parameters that scale each input.
- Bias ($b$): a learned offset that shifts the activation threshold.
- Activation function ($f$): transforms the weighted sum into an output.
The Heaviside (step) activation function.
The original perceptron uses a hard threshold:
$$f(z) = \begin{cases} 1 & \text{if } z \geq 0 \ 0 & \text{if } z < 0 \end{cases}$$
The bias $b$ shifts the location of this threshold along the $z$-axis.
What a perceptron can do.
A perceptron with a step activation function is a binary classifier: it separates data into two classes. This works only when the two classes are linearly separable — that is, a single hyperplane (line in 2D) can divide them.
Historical context.
The perceptron was invented by Frank Rosenblatt in 1958. The Mark I Perceptron was a physical machine built at the Cornell Aeronautical Laboratory — one of the first hardware implementations of a learning algorithm.
Training a perceptron: the perceptron training rule.
Gradient descent cannot be applied directly to the perceptron because the Heaviside step function is non-differentiable. Instead, we use the perceptron training rule, an online (sequential) algorithm that updates weights after each individual training example:
1. Pick a training example $(\mathbf{x}, y)$.
2. Compute the prediction: $\hat{y} = f!\left(\sum_i w_i x_i + b\right)$.
3. If $\hat{y} \neq y$, update: $$w_i \leftarrow w_i + \eta,(y - \hat{y}),x_i, \qquad b \leftarrow b + \eta,(y - \hat{y})$$ where $\eta$ is the learning rate.
4. Move to the next training example and repeat until all examples are correctly classified (or a stopping criterion is met).
This is an online (or sequential) algorithm — weights are updated one example at a time — in contrast to a batch approach, which computes updates over an entire dataset or mini-batch.
The perceptron as an artificial neuron.
The perceptron is the simplest example of an artificial neuron, a computational unit loosely inspired by biological neurons. All artificial neurons share the same five components: inputs, weights, a bias, an activation function, and an output.
Activation functions and non-linearity.
More powerful activation functions introduce non-linearity, enabling networks of neurons to solve complex, non-linear problems. Three common choices:
- Sigmoid: $\sigma(z) = \dfrac{1}{1+e^{-z}}$, outputs in $(0, 1)$. Useful for probabilities.
- Tanh: $\tanh(z)$, outputs in $(-1, 1)$. Zero-centered, often faster to converge than sigmoid.
- ReLU (Rectified Linear Unit): $f(z) = \max(0, z)$. Computationally cheap and avoids the vanishing-gradient problem for large $z$.

Extending to a Multi-Layer Perceptron (MLP).
A single perceptron can only solve linearly separable problems. By stacking layers of neurons, we build a multi-layer perceptron (MLP):
- Input layer: receives the raw features.
- Hidden layer(s): intermediate layers of neurons that learn increasingly abstract representations.
- Output layer: produces the final prediction.
Every neuron in one layer connects to every neuron in the next (fully connected). The hidden and output neurons each apply their own activation function.
Feed-forward networks.
An MLP is the simplest example of a feed-forward neural network: information flows in one direction only — from inputs, through the hidden layers, to the output. There are no cycles or feedback loops.
Training an MLP: backpropagation.
Unlike the perceptron, an MLP with differentiable activation functions (sigmoid, tanh, ReLU) can be trained with gradient descent. The key tool is backpropagation:
- Compute the loss (e.g., cross-entropy or mean squared error) at the output.
- Use the chain rule to propagate gradients of the loss backward through each layer.
- Update all weights and biases simultaneously using the gradient descent update rule.
This combination — differentiable activations + backpropagation + gradient descent — is the foundation of modern deep learning.