Classification

Date

Monday, February 2, 2026

Notes

In class, we discussed:

  1. Hierarchical clustering analysis (HCA), in which we group objects together into nested clusters. Results are visualized in a dendrogram. Like K-means, HCA is an example of unsupervised learning. There are two kinds of HCA:

    • Agglomerative (bottom-up): Start with each observation as its own cluster and merge until one all-encompassing cluster remains.
    • Divisive (top-down): Start with one large cluster and split into smaller clusters until each observation is its own cluster (or a stopping condition is met).
  2. HCA relies on distance and a linkage method (a clustering criterion that measures dissimilarity between two sets as a function of pairwise distances). Common linkage methods include maximum, minimum, average, centroid, and Ward’s (which minimizes within-cluster variance).

  3. Issues with HCA:

    • HCA is computationally expensive.
    • Sensitive to noise and outliers.
    • Different distance metrics and linkage methods produce different results.
    • One must make an arbitrary choice about where to “cut” the dendrogram.
    • HCA is “locally greedy” — it picks the best solution at each step, not the globally optimal one.
  4. A formal definition of classification: a qualitative response that finds a boundary separating data into classes. Classification is an example of supervised learning.

  5. Metrics for evaluating the performance of a binary classifier (where data are labeled positive, $P$, or negative, $N$):

    • True positives ($TP$) and false negatives ($FN$) make up the true number of positive entries.

    • True negatives ($TN$) and false positives ($FP$) make up the true number of negative entries.

    • The confusion matrix:

      $$\begin{bmatrix} TP & FN \ FP & TN \end{bmatrix}$$

    • Error: $err = \frac{FP + FN}{TP + FP + TN + FN}$ (fraction misclassified; ideally 0).

    • Accuracy: $acc = 1 - err$ (fraction correctly classified; ideally 1).

    • Recall (sensitivity, TP-rate): $TPR = \frac{TP}{TP + FN}$ (ideally 1).

    • Specificity (TN-rate): $TNR = \frac{TN}{TN + FP}$ (ideally 1).

    • Precision: $pr = \frac{TP}{TP + FP}$ (ideally 1). Note that as precision increases, recall decreases.

    • F1 score: $F_1 = \frac{TP}{TP + (FN + FP)/2}$ (ideally 1).

  6. The Receiver Operating Characteristic (ROC) curve, which plots the true positive rate vs. the false positive rate for different threshold values.

  7. Splitting data to train and evaluate a classifier:

    • Training set (60–80%): Used to fit the model.
    • Validation set (10–20%): Used to evaluate the model fit during training.
    • Test set (10–20%): Used to evaluate final model performance. This set should be set aside and only used at the very end.
  8. K-fold cross-validation, which is useful when data are limited:

    1. Split the data into $k$ (typically 5–10) equally sized “folds.”
    2. Train on $k-1$ folds, using the remaining fold as a validation set.
    3. Repeat $k$ times, rotating which fold is held out.
    4. Average performance across all $k$ runs.