Clustering

Date

Friday, January 30, 2026

Links of interest

Scikit-learn clustering overview

Notes

In class, we first reviewed Wednesday’s in-class assignment together, discussing what went well and what was challenging.

The remainder of the lecture focused on clustering, an unsupervised learning technique for grouping similar data. We discussed:

What clustering is and why we do it.
When we cluster, we group together similar data on the basis of their properties — without any pre-existing labels. Reasons to cluster include:
- We want to make comparisons but lack a priori groupings.
- We want to simplify a dataset, for example by identifying or removing outliers.
- We are intentionally looking for anomalies in our data.
Clustering as unsupervised learning.
Clustering is a form of unsupervised learning: there is no training data in the supervised sense. We apply clustering algorithms to find hidden patterns or structure within a dataset.
The many flavors of clustering.
“Clustering” does not refer to a single algorithm. Common approaches include k-means, DBSCAN, hierarchical clustering, spectral clustering, and Gaussian mixture models, among others. Each has different assumptions and strengths (see the scikit-learn overview above for a visual comparison).
How clustering differs from PCA.
Both PCA and clustering simplify data, but in different ways:
- PCA reduces the dimensionality of the data — it finds a lower-dimensional representation.
- Clustering finds homogeneous groups within the observations.
The two are complementary and it is common to apply PCA first (to reduce noise and dimensionality) and then cluster in the reduced space.
Distance as the basis for similarity.
Most clustering algorithms rely on some measure of distance: data that are closer together are considered more similar than data that are far apart. Common distance metrics include:
- Euclidean (L2): The straight-line distance between two points. $$d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$$ Often used when features have similar units or scales.
- Manhattan (L1): The sum of absolute differences. $$d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} |x_i - y_i|$$
- Geodesic: The shortest path between two points on the surface of a sphere. Critical for geographic applications (GPS, geodesy).
- Correlation-based: Using Pearson or Spearman rank correlation as a similarity measure.
- Covariance-based: Accounts for spatial covariance (e.g., from variogram models). Common in geostatistics.
- Cosine similarity: The cosine of the angle between two data vectors. Frequently used for text or image data.
K-means clustering.
K-means is the most widely used clustering algorithm. It partitions data into $k$ clusters by iterating between two steps:
1. Set initial values for the centroid (mean) of each cluster.
2. Compute the distance from each observation to each centroid.
3. Assign each observation to the nearest centroid.
4. Recompute each centroid as the mean of all points assigned to it.
5. Repeat steps 2–4 until assignments stop changing (or a stopping criterion is met).
K-means uses Euclidean distance and minimizes inertia — the total within-cluster sum of squared distances to each centroid.
Practical considerations with k-means.
- Random initialization: Because initial centroids are chosen randomly, k-means can give different results on different runs. It is common to run the algorithm multiple times and keep the best result.
- Choosing $k$ — the elbow method: A key challenge is deciding how many clusters to use. The elbow method involves running k-means for an increasing number of clusters and plotting the average within-cluster distance (inertia) as a function of $k$. The “elbow” in the curve — where adding more clusters yields diminishing returns — suggests a good value of $k$.