Outlier Detection

Types of outliers #

With regards to the distribution #

  • Univariate: can be found when looking at a distribution of values in a single feature space.
  • Multivariate: can be found in a n-dimensional space (of n-features).

With regards to the environment #

  • Point outliers: single data points that lay far from the rest of the distribution.
  • Contextual outliers: can be noise in data, such as punctuation symbols when realizing text analysis or background noise signal when doing speech recognition.
  • Collective outliers: Collective outliers can be subsets of novelties in data such as a signal that may indicate the discovery of new phenomena.

Most common causes of outliers on a data set #

  • Data entry errors (human errors)
  • Measurement errors (instrument errors)
  • Experimental errors (data extraction or experiment planning/executing errors)
  • Intentional (dummy outliers made to test detection methods)
  • Data processing errors (data manipulation or data set unintended mutations)
  • Sampling errors (extracting or mixing data from wrong or various sources)
  • Natural (not an error, novelties in data)

Hint

Outliers that are not a product of an error are called novelties.

Before starting the detection of outliers #

  • Which and how many features am I taking into account to detect outliers ? (univariate / multivariate)
  • Can I assume a distribution(s) of values for my selected features? (parametric / non-parametric)
  • Z-Score or Extreme Value Analysis (parametric): a metric that indicates how many standard deviations a data point is from the sample’s mean, assuming a gaussian distribution.
  • Dbscan (Density Based Spatial Clustering of Applications with Noise): applied to detect outliers in nonparametric distributions in many dimensions. it is focused on finding neighbors by density (MinPts) on an ‘n-dimensional sphere’ with radius ɛ. A cluster can be defined as the maximal set of ‘density connected points’ in the feature space.
  • Isolation forests: isolation forests are an effective method for detecting outliers or novelties in data. Isolation forest’s basic principle is that outliers are few and far from the rest of the observations.
  • Probabilistic and Statistical Modeling (parametric)
  • Linear Regression Models (PCA, LMS)
  • Proximity Based Models (non-parametric)
  • Information Theory Models
  • High Dimensional Outlier Detection Methods (high dimensional sparse data)

Anomaly detection Literature #

  • Statistical AD techniques: fit a statistical model for normal behavior
  • Density-based - ex: Local Outlier Factor (LOF) and variantes (COF ODINLOCI)
  • Support estimation - OneClassSVM
  • High-dimensional techniques: - Spectral Techniques - Random Forest - Isolation Forest