Assessing ML workflow performance

Date

Monday, March 9, 2026

Notes

In class, we stepped back from individual architectures to consider how to evaluate and refine an entire ML workflow — from data through model choice through hyperparameter selection.


Below are key concepts from this lecture:

The ML workflow

A complete pipeline:

$$\text{Acquire data} \to \text{Filter} \to \text{Reduce dimensions} \xrightarrow{\text{(optional)}} \text{Resample} \to \text{Train model} \to \text{Outputs} \to \text{Downstream tasks}$$

Each stage introduces choices that affect the final result and can be a source of error.

Evaluating model performance

Going beyond a single accuracy number:

  • Evaluate on a test set to get an unbiased estimate on held-out data.
  • Apply to a benchmark dataset with known ground truth; compare against published baselines.
  • Test on entirely unseen data from a new collection, time period, or geographic region to probe true generalization.
  • Examine residuals for systematic bias or structure that aggregate metrics hide.
  • Assess performance on edge cases and rare classes, not just overall accuracy.

Evaluating model choice

Choosing between candidates fairly:

  • Train multiple models and compare on the same test set.
  • Use cross-validation for a more robust generalization estimate — a single split may be lucky or unlucky.
  • Compare every candidate against a baseline (e.g., a simple mean predictor) to check whether added complexity helps.
  • Consider interpretability and computational cost alongside raw accuracy.
  • Check whether performance differences are statistically significant — similar scores on a small test set may reflect noise.

Parameters vs. hyperparameters

Parameters are learned via training (e.g., neural network weights and biases).

Hyperparameters are set before training begins: number of trees in a random forest, learning rate, network depth, regularization strength.

The hyperparameter space can be explored systematically, analogously to how parameters are optimized.

Random forest hyperparameters: number of trees, maximum depth, number of features at each split, minimum samples to split a node, minimum samples in a leaf, whether to use bootstrap resampling.

Exhaustively evaluate all combinations of a specified hyperparameter grid.

With a $3 \times 3 \times 2$ grid and 5-fold cross-validation: $$3 \times 3 \times 2 \times 5 = 90 \text{ model fits}$$

Grid search guarantees finding the best combination within the grid, but becomes prohibitively expensive as the number of hyperparameters grows.

Other hyperparameter search methods

  • Random search: sample combinations at random. Often finds good solutions faster than grid search in high-dimensional spaces, because it explores a broader region rather than refining one dimension at a time.

  • Bayesian optimization: build a cheap surrogate model of the objective function and use it to propose the most promising configurations to evaluate next. More efficient than random search because it learns from previous evaluations.

  • Gradient-based optimization: differentiate through the validation loss with respect to hyperparameters directly. Only applicable in certain model families.