Classification
Date
Monday, February 2, 2026Links of interest
Notes
In class, we discussed:
Hierarchical clustering analysis (HCA), in which we group objects together into nested clusters. Results are visualized in a dendrogram. Like K-means, HCA is an example of unsupervised learning. There are two kinds of HCA:
- Agglomerative (bottom-up): Start with each observation as its own cluster and merge until one all-encompassing cluster remains.
- Divisive (top-down): Start with one large cluster and split into smaller clusters until each observation is its own cluster (or a stopping condition is met).
HCA relies on distance and a linkage method (a clustering criterion that measures dissimilarity between two sets as a function of pairwise distances). Common linkage methods include maximum, minimum, average, centroid, and Ward’s (which minimizes within-cluster variance).
Issues with HCA:
- HCA is computationally expensive.
- Sensitive to noise and outliers.
- Different distance metrics and linkage methods produce different results.
- One must make an arbitrary choice about where to “cut” the dendrogram.
- HCA is “locally greedy” — it picks the best solution at each step, not the globally optimal one.
A formal definition of classification: a qualitative response that finds a boundary separating data into classes. Classification is an example of supervised learning.
Metrics for evaluating the performance of a binary classifier (where data are labeled positive, $P$, or negative, $N$):
True positives ($TP$) and false negatives ($FN$) make up the true number of positive entries.
True negatives ($TN$) and false positives ($FP$) make up the true number of negative entries.
The confusion matrix:
$$\begin{bmatrix} TP & FN \ FP & TN \end{bmatrix}$$
Error: $err = \frac{FP + FN}{TP + FP + TN + FN}$ (fraction misclassified; ideally 0).
Accuracy: $acc = 1 - err$ (fraction correctly classified; ideally 1).
Recall (sensitivity, TP-rate): $TPR = \frac{TP}{TP + FN}$ (ideally 1).
Specificity (TN-rate): $TNR = \frac{TN}{TN + FP}$ (ideally 1).
Precision: $pr = \frac{TP}{TP + FP}$ (ideally 1). Note that as precision increases, recall decreases.
F1 score: $F_1 = \frac{TP}{TP + (FN + FP)/2}$ (ideally 1).
The Receiver Operating Characteristic (ROC) curve, which plots the true positive rate vs. the false positive rate for different threshold values.
Splitting data to train and evaluate a classifier:
- Training set (60–80%): Used to fit the model.
- Validation set (10–20%): Used to evaluate the model fit during training.
- Test set (10–20%): Used to evaluate final model performance. This set should be set aside and only used at the very end.
K-fold cross-validation, which is useful when data are limited:
- Split the data into $k$ (typically 5–10) equally sized “folds.”
- Train on $k-1$ folds, using the remaining fold as a validation set.
- Repeat $k$ times, rotating which fold is held out.
- Average performance across all $k$ runs.