Assessing ML workflow performance

Date

Monday, March 9, 2026

Links of interest

Notes

In class, we stepped back from individual architectures to consider how to evaluate and refine an entire ML workflow — from data through model choice through hyperparameter selection.

Below are key concepts from this lecture:

The ML workflow

A complete pipeline:

$$\text{Acquire data} \to \text{Filter} \to \text{Reduce dimensions} \xrightarrow{\text{(optional)}} \text{Resample} \to \text{Train model} \to \text{Outputs} \to \text{Downstream tasks}$$

Each stage introduces choices that affect the final result and can be a source of error.

Evaluating model performance

Going beyond a single accuracy number:

Evaluate on a test set to get an unbiased estimate on held-out data.
Apply to a benchmark dataset with known ground truth; compare against published baselines.
Test on entirely unseen data from a new collection, time period, or geographic region to probe true generalization.
Examine residuals for systematic bias or structure that aggregate metrics hide.
Assess performance on edge cases and rare classes, not just overall accuracy.

Evaluating model choice

Choosing between candidates fairly:

Train multiple models and compare on the same test set.
Use cross-validation for a more robust generalization estimate — a single split may be lucky or unlucky.
Compare every candidate against a baseline (e.g., a simple mean predictor) to check whether added complexity helps.
Consider interpretability and computational cost alongside raw accuracy.
Check whether performance differences are statistically significant — similar scores on a small test set may reflect noise.

Parameters vs. hyperparameters

Parameters are learned via training (e.g., neural network weights and biases).

Hyperparameters are set before training begins: number of trees in a random forest, learning rate, network depth, regularization strength.

The hyperparameter space can be explored systematically, analogously to how parameters are optimized.

Random forest hyperparameters: number of trees, maximum depth, number of features at each split, minimum samples to split a node, minimum samples in a leaf, whether to use bootstrap resampling.

Grid search

Exhaustively evaluate all combinations of a specified hyperparameter grid.

With a $3 \times 3 \times 2$ grid and 5-fold cross-validation: $$3 \times 3 \times 2 \times 5 = 90 \text{ model fits}$$

Grid search guarantees finding the best combination within the grid, but becomes prohibitively expensive as the number of hyperparameters grows.

Other hyperparameter search methods

Random search: sample combinations at random. Often finds good solutions faster than grid search in high-dimensional spaces, because it explores a broader region rather than refining one dimension at a time.
Bayesian optimization: build a cheap surrogate model of the objective function and use it to propose the most promising configurations to evaluate next. More efficient than random search because it learns from previous evaluations.
Gradient-based optimization: differentiate through the validation loss with respect to hyperparameters directly. Only applicable in certain model families.