In the last blog post, I described two rules: the empirical and IQR rules, that could be used to calculate outliers. However, those rules were only valid with normally distributed data. Spend data is rarely normally distributed and is typically right-skewed. This means it has many small transactions and few large transactions like figure (1) below. In those cases, you need to calculate outliers using the bootstrapping method. We will be using the data from figure (1) for the duration of the article, so let’s call it “transactions”.

The goal of this post is to find out a way to calculate outliers on this transaction data, even though it is not normally distributed, using bootstrapping. One way to calculate outliers would be to find out exactly what distribution this is and use the function associated with it to calculate percentiles.

This happens to be a beta distribution, and I know this because I generated the data myself. However, we won’t often know what distribution our transaction data will follow, and to find out we would have to do a series of hypothesis tests.

We don’t want to go through that trouble so we will instead use a method that doesn’t rely on knowing what specific distribution the spend data follows. The method we will use is called bootstrapping.

1  Some Definitions

To begin there are a couple of terms we should know: sampling with replacement, sampling without replacement, and of course bootstrapping.

1.1  Statistic

A “statistic” is a characteristic of a sample for example (mean, median, standard deviation, nth-percentile.

1.2  Sampling without Replacement

If we have a “population” of transaction amounts called “amounts” as follows:

amounts = [10, 10, 10, 10, 15, 15, 40, 100]

Sampling without replacement means that after an amount has been picked randomly, it is no longer available to be sampled again. Below is an example of what happens to the transaction list and sample list during sampling without replacement.

Step 1:

transactions = [10, 10, 10, 10, 15, 15, 40, 100]
samples = [ ]

Step 2:

transactions = [10, 10, 10, 10, 15, 40, 100]
samples = [15]

Step 3:

transactions = [10, 10, 10, 15, 40, 100]
samples = [10, 15]

. . .

Step 8:

transactions = [ ]
samples = [10, 10, 10, 10, 15, 15, 40, 100]

At step 8, we see that we have just transferred the orignal data from transactions to samples. This is because the observations don’t get replaced after sampling.

1.3  Sampling with Replacement

Sampling with replacement means the list we are sampling from remains unchanged after the sample amount has been taken, and therefore a single element could be randomly chosen more than once. An example of sampling with replacement is shown below.

Step 1:

transactions = [10, 10, 10, 10, 15, 15, 40, 100]
samples = [ ]

Step 2:

transactions = [10, 10, 10, 10, 15, 15, 40, 100]
samples = [15]

Step 3:

transactions = [10, 10, 10, 10, 15, 15, 40, 100]
samples = [10, 15]

.
.
.

Step 8:

transactions = [10, 10, 10, 10, 15, 15, 40, 100]
samples = [10, 10, 10, 10, 10, 15, 15 100]

At step 8, we see that the transactions list is still full because we replaced the observation after sampling. It is also important to note that the samples list is not identical to transactions and we have created a new observation from transactions without actually collecting more data.

1.4  Bootstrapping

Bootstrapping is any method that employs sampling with replacement. We will use sampling with replacement repeatedly to generate many samples of transaction data and build a confidence interval around a sample statistic (e.g. the 99th percentile).

Going back to our transaction data from figure (1), we can calculate the 99th percentile, aka the outliers, without bootstrapping below. However, we have no idea if this is the actual threshold for the 99th percentile since these transaction amounts are just a sample of possible transactions. We also have no idea what a reasonable range for this threshold would be.

np.percentile(transactions, 99)

2  Applying the Bootstrap

Now for the bootstrap! First we will take n random samples with replacement from transactions where n is the number of elements in the sample. This will create a reconstruction of the orginal data that is slightly different. We can see the original dataset in figure (2) and the recreation from random sampling with replacement in figure (3).

def bootstrap_sample(amounts):
    return np.random.choice(amounts, len(amounts), replace=True)

def percentile_99(sample):
     return np.percentile(sample, 99)

def bootstrap_confidence_interval(data):
    """
    Creates list of 10000 99th percentile bootstrap replicates. 
    """
    bs_samples = np.empty(10000)
    
    for i in range(10000):
        bs_samples[i] = percentile_99(bootstrap_sample(data))

    return bs_samples

transactions_ci = bootstrap_confidence_interval(transactions)

We can find the 99th percentile of the bootstrap sample which in this case is $578,290.19 notice that it is different than the 99th percentile of the original data. Now to build the confidence interval, we must take many bootstrap samples and find their 99th percentiles. Figure (5) shows the distribution of 99th percentiles obtained from the bootstrap samples.

Using the distribution from figure (5) we can find the 95% confidence interval by taking the 2.5 and 97.5 percentile. We are therefore 95% confident that the true value of the 99th percentile of transactions is between $$551,831.94 and $$627,067.18

To actually label transactions as outliers we need to pick a specific dollar amount whereas above we just defined a confidence interval. We could pick the 95th percentile of the distribution from figure 5. This is a good choice because it means that we are confident that we are actually only choosing transaction in the 99th percentile and not those less than that.

np.percentile(transactions_ci, 95)

In this case, we are confident that amounts greater than $623,974.74 are in the 99th percentile and we consider them outliers.

4  Conclusion

In this post, we have learned some basic definitions used in statistics such as statistics, sampling with replacement, sampling without replacement, and bootstrapping. We applied the bootstrap method to a dataset representing transaction amounts and found the confidence interval for the 99th percentile sample statistic. We did this because we’ve defined an outlier as above 99th percentile and we want to be sure that is in the 99th percentile.

We then found the 95th percentile of the distribution of sample statistics to use as our threshold to determine what transactions should be considered outliers. This gives us confidence in our prediction of the 99th percentile because it’s above the 95% confidence interval of that statistic.

Hopefully this has demonstrated the power of Bootstrapping which can be used to calculate outliers on non-normal distributions and it gives us confidence in our prediction. In part 3 of Calculating Outliers in Spend Data, we will bring everything together and apply this to some examples.