Classification, continued

Date

Wednesday, February 4, 2026

Notes

In class, we discussed:

  1. Moving from binary to multiclass classification. When we have more than two classes, there are two general strategies for adapting a binary classifier:

    • One-vs-Rest (OvR): Train $n$ binary classifiers, one for each class. For class $k$, the classifier distinguishes class $k$ (positive) from all other classes (negative). At prediction time, pick the class with the highest confidence score.
    • One-vs-One (OvO): Train a classifier for every pair of classes. For $n$ classes, this requires $\binom{n}{2} = \frac{n(n-1)}{2}$ classifiers. Each classifier “votes” for one class and the class with the most votes wins. Note that this can require many classifiers (e.g., 45 for 10 classes).
  2. Some algorithms natively handle multiple classes without needing OvR or OvO, including:

    • Decision trees
    • Random forests (ensembles of decision trees)
    • k-nearest neighbors (majority vote among neighbors)
    • Naive Bayes (calculates $P(y = k \mid \mathbf{x})$ for all classes directly)
    • Neural networks (output probabilities for all classes using softmax)
  3. The softmax function, which converts a vector of raw scores (logits) $\mathbf{z} = [z_1, z_2, \ldots, z_n]$ into probabilities:

    $$P(y = k \mid \mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{n} e^{z_j}}$$

    Properties: all outputs are in $(0, 1)$, outputs sum to 1, and larger logits produce higher probabilities. Compared to argmax (“winner takes all”), softmax assigns the highest probability to the winner while still giving smaller probabilities to the rest.

  4. Ensemble methods and the motivation for Random Forests. Issues with a single decision tree include overfitting, sensitivity to small changes in training data, and high variance. The basic idea of an ensemble is that combining many models and aggregating their predictions reduces overall error, provided individual models make independent errors.

  5. A Random Forest introduces two sources of randomness to ensure each tree sees different parts of the data:

    • Bagging (bootstrap aggregating): Subsample the training set with replacement.
    • Feature randomness: At each split, only consider a random subset of features.
  6. Advantages and disadvantages of Random Forests:

    • Advantages: Less overfitting, reduced sensitivity to outliers/noise, no need for feature scaling, and the ability to rank features by their power to improve splits.
    • Disadvantages: Challenging to visualize hundreds of trees, computationally expensive, and can still overfit on small training datasets.
  7. Hyperparameters that can be tuned in a Random Forest: the number of trees, the maximum depth of each tree, how many features to consider at each split, and the minimum number of samples required to split a node.