Dimensional Reduction

Date

Monday, January 26, 2026

Notes

In class, we:

  1. Reviewed your first team presentations. Some notes I provided:
    • Data characterizations should include both detailed metadata and visualizations, ideally ones that you make.
    • Skip intricate slide backgrounds (including photographs). Solid colors are best.
    • If you show something on a slide, you must explain it!
  2. Formally defined dimensions as “the number of parameters (variables) in a dataset”.
  3. Explored why we might want to reduce the number of dimensions in a dataset, including:
    • Improve the ability to visualize data.
    • Remove parameters that either are highly correlated or contribute very little to variance in your dataset.
    • Reduce computational burden (i.e., reducing the number of parameters lessens the number of calculations, amount of memory, etc. required for processing).
    • Contend with the curse of dimensionality (this week’s reading assignment!).
  4. Established two ways to reduce data dimensions:
    1. Feature selection, such as by filtering
    2. Feature extraction.
  5. Defined variance, covariance, and correlation, which are all important to understand how our data are spread and related.
  6. Learned about Principal Component Analysis (PCA), which is a feature extraction method, and applied it to a published dataset.

Below are some details about describing data and implementing PCA:

Variance

We might begin by asking a simple question: How dispersed are our data?

$$s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}$$

where $s^2$ is the sample variance, $n$ is the sample size, $x_i$ is each observation, and $\bar{x}$ is the sample mean.

By itself, variance does not tell us if a feature is valuable (except in cases where $s^2$ = 0).

Covariance

We might be interested in how one variable is related to another. In such a case, we can calculate covariance:

$$s_{xy} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{n-1}$$

where $x_i, y_i$ are paired observations and $\bar{x}, \bar{y}$ are the sample means.

Note that the units of covariance are in the units of the data and that there is no scaling (so $s_{xy}$ can range between $-\infty$ and $\infty$).

Correlation

Let us say we do want to compare covariance between datasets. One way to do so is to use a correlation coefficient (here, Pearson’s):

$$r = \frac{\text{Cov}(X, Y)}{s_x \cdot s_y} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}$$

where $s_x, s_y$ are sample standard deviations.

Coefficients range between -1 and 1.

PCA: A formal definition

PCA uses an orthogonal linear transformation to transform a set of observations into a new coordinate system.

The new axes are the principal components. The first principal component (PC1) accounts for the largest amount of variance in the dataset.

PCA effectively reduces data dimensionality while maximizing variance.

PCA: Implementation I

To undertake PCA, we must:

  1. Center and standardize data. You can accomplish centering by subtracting the mean.

  2. Calculate the covariance matrix of the standardized data. By definition, the covariance matrix is positive symmetric, and thus can be diagonalized.

  3. Calculate the Singular Value Decomposition (SVD) of the covariance matrix $\mathbf{C}$.

PCA: Implementation II

As the name implies, SVD decomposes the data covariance matrix $\mathbf{C}$ into 3 terms:

$$\mathbf{X} = \mathbf{U} \Sigma \mathbf{V}^T,$$

where $\mathbf{V}^T$ contains the eigenvectors, or principal components.

Principal components are normalized, centered around zero.

PC1’s eigenvector has the highest eigenvalue in the direction that has the highest variance.