COPOD Outlier

The COPOD Outlier Algorithm, or Copula-Based Outlier Detection, is a detection method used to find the outliers of a data set. Outliers deviate from the norm of the data which affect results. A copula describes the dependence structure between random variables and allows you to create a cumulative distribution function based on these variables. It shows the marginal probability density of each variable is uniform from [0,1].

Motivation

There are several reasons to use this algorithm. Outliers have an impact on the result of statistical analysis thus it becomes pivotal to remove these data points from the data set to preserve the data science model. In general, outlier detection algorithms have many applications such as credit card fraud detection and synthetic data generation. Due to the demand of the application, these algorithms need to have high detection performance, fast execution, and great interpretability. COPOD is a relatively novel method that doesn't require any parameter tuning yet still preforms as one of the top outlier detection algorithms while still being efficient on a high dimensional dataset.

Algorithm

These functions then enable us to separate out the marginal distributions from the dependency structure of a given multivariate distribution. In the image below it describes the Copula Outlier Detector algorithm. The process consists of 3 main steps which are all used to determine the overall anomaly score for the data. The first step is broken down into 3 sections that is applied for every dimension of the data set: Fit both the left and right tail empirical cumulative distribution functions and then computing the skew coefficient.

The next step is utilizing the results in step 1 to compute the empirical cupola for the left and right tails before finding the skewness-corrected empirical copula values for each row. After this, you would have 3 new values per scalar value in the data frame.

Lastly, you need to compute the anomaly score for each row so we are able to rank them. Here we see that the larger tail probabilities result in a smaller outlier score since we use negative log calculations. We can find the anomaly score by summing the negative logs of the left tail, right tail, and skewness-corrected empirical copulas individually and finding the maximum between the three sums.

We can go one step further by setting a threshold that describes the cutoff between a good data point and an outlier. However, for our application, we did not go into this area.

Image Source: https://arxiv.org/pdf/2009.09463.pdf

At the base level, the COPOD algorithm attempts to model the underlying probability distribution of a dataset by approximating a cumulative distribution function. Using this approximation, we assign a score to each data point which indicates how close the data point is to being an outlier for the data set. For this result, a higher score means that the data point is more likely to be an outlier when compared to the rest of the dataset.

However, the important thing to note is that the criteria that we score the words on can change to fit the application. For the purposes of Wordle, we used the same criterion as the other Machine Learning algorithms where we rank the words with favorable letter composition -- where they have more letters of higher frequency and they are in the most frequent locations.

Information Sources:

COPOD: Copula-Based Outlier Detection - Arxiv. https://arxiv.org/pdf/2009.09463.pdf.

Hoffman, Harrison. “Finding the Best Wordle Opener with Machine Learning.” Medium, Towards Data Science, 10 Feb. 2022, https://towardsdatascience.com/finding-the-best-wordle-opener-with-machine-learning-ce81331c5759.

Laurent, and Laurent. “Solving Wordle - Laurent Lessard.” Laurent Lessard -, 3 Feb. 2022, https://laurentlessard.com/solving-wordle/.