Neural networks (beginning with the perceptron)

Date

Friday, February 13, 2026

Notes

In class, we introduced the perceptron — the simplest building block of a neural network — and extended it to a multi-layer architecture. We discussed:

  1. The perceptron: structure.

    A perceptron takes one or more inputs, applies weights and a bias, and passes the result through an activation function to produce an output:

    $$z = \sum_i w_i x_i + b, \qquad \hat{y} = f(z)$$

    Components:

    • Inputs ($x_0, x_1, \ldots$): the features fed into the neuron.
    • Weights ($w_0, w_1, \ldots$): learned parameters that scale each input.
    • Bias ($b$): a learned offset that shifts the activation threshold.
    • Activation function ($f$): transforms the weighted sum into an output.
  2. The Heaviside (step) activation function.

    The original perceptron uses a hard threshold:

    $$f(z) = \begin{cases} 1 & \text{if } z \geq 0 \ 0 & \text{if } z < 0 \end{cases}$$

    The bias $b$ shifts the location of this threshold along the $z$-axis.

  3. What a perceptron can do.

    A perceptron with a step activation function is a binary classifier: it separates data into two classes. This works only when the two classes are linearly separable — that is, a single hyperplane (line in 2D) can divide them.

  4. Historical context.

    The perceptron was invented by Frank Rosenblatt in 1958. The Mark I Perceptron was a physical machine built at the Cornell Aeronautical Laboratory — one of the first hardware implementations of a learning algorithm.

  5. Training a perceptron: the perceptron training rule.

    Gradient descent cannot be applied directly to the perceptron because the Heaviside step function is non-differentiable. Instead, we use the perceptron training rule, an online (sequential) algorithm that updates weights after each individual training example:

    1. Pick a training example $(\mathbf{x}, y)$.
    2. Compute the prediction: $\hat{y} = f!\left(\sum_i w_i x_i + b\right)$.
    3. If $\hat{y} \neq y$, update: $$w_i \leftarrow w_i + \eta,(y - \hat{y}),x_i, \qquad b \leftarrow b + \eta,(y - \hat{y})$$ where $\eta$ is the learning rate.
    4. Move to the next training example and repeat until all examples are correctly classified (or a stopping criterion is met).

    This is an online (or sequential) algorithm — weights are updated one example at a time — in contrast to a batch approach, which computes updates over an entire dataset or mini-batch.

  6. The perceptron as an artificial neuron.

    The perceptron is the simplest example of an artificial neuron, a computational unit loosely inspired by biological neurons. All artificial neurons share the same five components: inputs, weights, a bias, an activation function, and an output.

  7. Activation functions and non-linearity.

    More powerful activation functions introduce non-linearity, enabling networks of neurons to solve complex, non-linear problems. Three common choices:

    • Sigmoid: $\sigma(z) = \dfrac{1}{1+e^{-z}}$, outputs in $(0, 1)$. Useful for probabilities.
    • Tanh: $\tanh(z)$, outputs in $(-1, 1)$. Zero-centered, often faster to converge than sigmoid.
    • ReLU (Rectified Linear Unit): $f(z) = \max(0, z)$. Computationally cheap and avoids the vanishing-gradient problem for large $z$.

  1. Extending to a Multi-Layer Perceptron (MLP).

    A single perceptron can only solve linearly separable problems. By stacking layers of neurons, we build a multi-layer perceptron (MLP):

    • Input layer: receives the raw features.
    • Hidden layer(s): intermediate layers of neurons that learn increasingly abstract representations.
    • Output layer: produces the final prediction.

    Every neuron in one layer connects to every neuron in the next (fully connected). The hidden and output neurons each apply their own activation function.

  2. Feed-forward networks.

    An MLP is the simplest example of a feed-forward neural network: information flows in one direction only — from inputs, through the hidden layers, to the output. There are no cycles or feedback loops.

  3. Training an MLP: backpropagation.

    Unlike the perceptron, an MLP with differentiable activation functions (sigmoid, tanh, ReLU) can be trained with gradient descent. The key tool is backpropagation:

    • Compute the loss (e.g., cross-entropy or mean squared error) at the output.
    • Use the chain rule to propagate gradients of the loss backward through each layer.
    • Update all weights and biases simultaneously using the gradient descent update rule.

    This combination — differentiable activations + backpropagation + gradient descent — is the foundation of modern deep learning.