Outlier Detection --- Basic

Ma Ming

Apr 7, 2017

What the definition of Outlier

“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.”

                    -- Hawkins  
  • abnormalities, discordants, eviants
  • Extreme value, Change point

Why outliers treatment is important?

Can you get outliers?

  • 2,3,4,4,3,4,5,2,6,3,16,3,6,5
  • 1,2,3,50,97,98,99
  • 1,2,3,4,5,6,7,8,10,10,12,13,14

Three Major Types Of Abnormalies

  • Point anomalies
  • Contextual anomalies
  • Collective anomalies

Famous applications of Anomaly Detection

  • Intrusion Detection System
  • Credit Card Fraud
  • Medical Diagnosis
  • Interesting Sensor Events

Data Model is Everything

  • Generative model
  • Regression-based model
  • Proximity-based model
  • Information theoretic

How to measure the performance

  • Precision
  • Recall
  • F1 score
  • \(Precision = \frac{tp}{tp+fp}\)
  • \(Recall = \frac{tp}{tp+fn}\)
  • \(F_1 = 2\frac{Precision*Recall}{Precision + Recall}\)

Basic Outlier Detection Algorithms

Simple Generative Model

  • IQR
  • Gaussian empirical rule

Clustered Based Outlier Detection.

  • DBSCAN
  • OPTICS

Density Based Outlier Detection.

  • LOF
  • KNN

Basic Generative Outlier Detection

IQR

IQR Means Inter Quartile Range
\(IQR = Q_3 - Q_1\)

Clustered Based Outlier Detection

How to detect outliers in this

DBSCAN

DBscan core conception

  • Core point
  • Density-Reachability
  • Density-Connectedness
  • 2 parameters : MinPts,eps

DBSCAN visualize

Method for determining the optimal minPts and eps values

  • \(MinPts >= D + 1\)
  • Knn-graph for \(eps\)

Detail about DBSCAN

  • Advantages
    1. Do not require define the number of clusters, as opposed to kmean
    2. Can find arbitrarily shape
    3. Outliers is robust
  • Disadvantages
    1. Not entirely deterministic
    2. Suffered from “Curse of dimensionality”
    3. Cannot cluster data sets well with large differences in densities
  • Complexity
    • Neighbour query with a index structure : \(O(n\log(n))\)
    • Worst case \(O(n^2)\)
    • Materialized distance matrix need \(O((n^2-n)/2)\) memory size

OPTICS algorithm

  • Ordering points to identify the clustering structure (OPTICS)
  • Basic idea is similar to DBSCAN
  • Addresses one of DBSCAN’s major weaknesses: the problem of detecting meaningful clusters in data of varying density
  • Spatially closest become neighbors in the ordering
  • 2 Params : \(\epsilon\), \(MinPts\)

OPTICS Visualization

OPTICS core conception

LOF

  • LOF (Local Outlier Factor) is an algorithm for identifying density-based local outliers.
  • With LOF, the local density of a point is compared with that of its neighbors.
  • If the former is signi.cantly lower than the latter (with an LOF value greater than one), the point is in a sparser region than its neighbors, which suggests it be an outlier.

LOF Conception

  • k-distance(A) be the distance of the object A to the k-th nearest neighbor
  • \[\mbox{reachability-distance}_k(A,B)=\max\{{\mbox{k-distance}}(B),d(A,B)\}\]
  • \[{\mbox{lrd}}(A):=1/({\frac {\sum _{B\in N_{k}(A)}{\mbox{reachability-distance}}_{k}(A,B)}{|N_{k}(A)|}})\]
  • \[{\mbox{LOF}}_{k}(A):={\frac {\sum _{B\in N_{k}(A)}{\frac {{\mbox{lrd}}(B)}{{\mbox{lrd}}(A)}}}{|N_{k}(A)|}}\]

LOF Detail

LOF Example

Iris Dataset in 3d

Pairs Plot with DBSCAN Clustering

Larger bubbles in the visualization have a larger LOF

Create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)

Plot the extracted OPTICS clustering

Pairs Plot with LOF

Show outliers with a biplot of the first two principal components.