Gradient descent

Date

Monday, February 9, 2026

Links of interest

Notes

In class, we:

Reviewed the previous week’s presentations. Some feedback I provided:
- Motivate first: your motivation should be concise and clear.
- Introduce everything: do not assume that your audience has the same level of knowledge as you.
- Provide context: tell us what has already been done and what you are going to do differently.
- Use buildups and callouts: if you are sharing many things on a single slide, have each item show up separately.
Discussed how we ensure that a model gives us a good enough estimate of $y$ for a given $x$: we train the model by finding optimal parameters (weights) that minimize the difference between predictions and ground truths.
Reviewed loss (applied to a single data point) and cost (applied to an entire dataset or subset) functions as the tools we use to determine if our model is doing well or poorly.

The remainder of the lecture focused on gradient descent, which is a general-purpose optimization algorithm for finding model parameters. We:

Applied gradient descent to a simple case: linear regression.
- Hypothesis: For $m$ training examples, where $x^{(i)}$ is an input and $y^{(i)}$ is the known output, the linear model is:
  $$h_\theta(x) = \theta_0 + \theta_1 x$$
  where $\theta_0$ is the intercept and $\theta_1$ is the slope.
- Loss function (squared error for a single data point):
  $$L^{(i)}(\theta_0, \theta_1) = \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2$$
- Cost function (average squared error over $m$ data points):
  $$J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2$$
- Goal: Find $\theta_0$ and $\theta_1$ that minimize $J(\theta_0, \theta_1)$.
Discussed how the negative gradient of the cost function gives us the “direction of steepest descent”:
$$-\nabla J(\theta_0, \theta_1) = -\left(\frac{\partial J}{\partial \theta_0}, \frac{\partial J}{\partial \theta_1}\right)$$
Learned about the update rule, which we repeat until convergence (i.e., when each update changes very little):
$$\theta_0 := \theta_0 - \frac{\alpha}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)$$
$$\theta_1 := \theta_1 - \frac{\alpha}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x^{(i)}$$
where $\alpha$ is the learning rate. Note that for $\theta_1$, the error is weighted by $x^{(i)}$ since the slope matters more for larger inputs. It is important to update $\theta_0$ and $\theta_1$ simultaneously (compute both partial derivatives before updating either).
Discussed the role of the learning rate $\alpha$:
- If $\alpha$ is too small: convergence is very slow (many tiny steps).
- If $\alpha$ is too large: we may overshoot the minimum and diverge.
- In practice, start with a reasonable $\alpha$ (e.g., 0.01). As gradient descent approaches a minimum, the gradient naturally gets smaller, so steps get smaller too — even with a fixed $\alpha$.