Filtering and resampling

Date

Wednesday, January 21, 2026

Links of interest

A visual exploration of randomization/permutation.

Notes

Among other points, we discussed:

That filtering a dataset means refining data to:
- Remove errors.
- Ensure that features¹ are relevant to your question.
- Reduce noise and/or outliers.
- Improve analytical efficiency.
The fact that filtering inevitably modifies the raw dataset. As such, it is important to either fitler in code or store filtered data separately from raw data.
The importance of documenting all filtering operations (ideally, in a README.md).
Various methods for filtering, including removing data by attribute value, detrending, smoothing, and isolating data by frequency.
Resampling, which refers to the process of generating one or more new data points from some sample² .

Below are formal definitions of several different resampling methods:

Randomization/permutation

Given two populations, $A$ and $B$, and a statistic, $x$, how might we determine if $x_A$ differs from $x_B$?

Well, we can randomly assign observations to either $A$ or $B$, calculate $x_A - x_B$ and repeat to build a distribution of differences.

We then ask: what proportion of combinations give a difference as large (or larger) than our sample’s $x_A - x_B$? The answer to this question is a p-value.

If you use every possible combination of rearranging data, then you are permutation testing.

Bootstrap

Let us assume that you have some data, $D$, that describe some true population well (which is not always true!).

Let us also assume that you’ve calculated some parameter, $x$, using $D$. You are interested in how $x$ might vary.

In such a case, you might decided to repeatedly generate new samples, each time drawing from $D$ independently and with replacement. For each new sample, you could calculate a new value, $x_s$, for the parameter you are interested in. The distribution of $x_s$ can provide additional information about $x$.

This process is known as bootstrapping.

Jackknife

Given a dataset, $D$ and a target statistic/parameter, $x$, you can systematically leave out one observation from $D$ and calculate a pseudovalue, $k$, of the target of interest.

As before, you can use the collection of $k$ values to better estimate $x$.

This so-called jackknife estimate predates many other resampling techniques.

A note on Monte Carlo methods

We have been taking draws from samples of populations, where the true distribution of data is unknown.

But what if we have knowledge about some underlying distributions or a model in mind?

Monte Carlo methods are one class of resampling techniques. When implementing a Monte Carlo approach, you take draws from predefined distributions in order to better answer some question.

These draws can be independent or not (e.g., Markov Chains).

Monte Carlo approaches have many applications, from approximations of complex functions to error propagation.

Characteristics or attributes of data. ↩
Here, sample refers to a subset (i.e., statistical sample) of some population. ↩