Methods

Model-based
Graphical
Cluster-Based
Distance-Based
Supervised

Model-Based Outlier Detection

We can fit a probabilistic model, outliers are those with low probability.

Assume data follows normal distribution, z-score:
$z_i = \frac{X_i - \mu}{\theta}$, where $\mu = \frac{1}{n}\sum_{i=1}^nX_i$ and $\theta = \sqrt{\frac{1}{n}\sum_{i=1}^n(X_i-\mu)^2}$

However, the mean and variance are sensitive to outliers. And the data may not concentrate around the mean.

Graphical Outlier Detection

Draw plot of the data, and identify the outlier by human

Cluster-Based Outlier Detection

cluster data
find points don’t belong to the clusters

Distance-Based Outlier Detection

Skip the model/plot/cluster step, just measure distances.

Global Distance-Based Outlier Detection - KNN

For each point, compute the average distance to its KNN
Choose points with biggest values (outliers are far away from their KNNs)

Local Distance-Based Outlier Detection

Outlierness Ratio = $\frac{average \space distance\space of\space ‘i’\space to\space its\space KNNs}{average \space distance \space of \space neighbours\space of \space ‘i’ \space to \space their \space KNNs}$

If outlinerness ratio > 1, $x_i$ is further away from neighbours than expected

Supervised Outlier Detection

$y_i = 1$ if $x_i$ is an outlier
$y_i = 0$ if $x_i$ is a regular point

But we need to know what outliers look like, and we may not detect new types of outliers.

CPSC340 Outlier Detection