This is my personal notes taken for the course Machine learning by Standford. Feel free to check the assignments.
Also, if you want to read my other notes, feel free to check them at my blog.
Consider a set of points, in a training example (represented by blue points) representing the regular distribution of features and . The aim of anomaly detection is to separate anomalies from the test set (represented by the red points) based on distribution of features in the training example (represented by the blue points). For example, in the plot below, while point A is not an outlier, point B and C in the test set can be considered to be anomalous (or outliers).
Formally, in anomaly detection the training examples are considered to be normal or non-anomalous, and then the algorithm must decide if the next example, is anomalous or not. So given the training set, it must come up with a model that gives the probability of a sample being normal (high probability is normal, low probability is anomaly).Resulting decision boundary is defined by,
Some of the popular applications of anomaly detection are:
Gaussian distribution is also called Normal Distribution.
If , and follows Gaussian distribution with mean and variance , denoted as, (The little or 'tilde' can be read as "distributed as").
A Standard normal Gaussian distribution is a bell-shaped probability distribution curve with mean and standard deviation , as shown in the plot below.
The parameters and signify the centring and spread of the gaussian curve as marked in the plot above. It can also be seen that the density is higher around the mean and reduces rapidly as distance from mean increases.
The probability of in a gaussian distribution, is given by,
where,
The effect of mean and standard deviation on a Gaussian plot can be seen clearly in figure below.
It can be noticed that, while mean defines the centering of the distribution, the standard deviation , defines the spread of the distribution. Also, as the spread increases the height of the plot decreases, because the total area under a probability distribution should always integrate to the value 1.
Given a dataset, as in the previous section, , it is possible to determine the approximate (or the most fitting) gaussian distribution by using the following parameter estimation,
Given a training set of examples, where each example is a vector, .
Remark:
The features are independent of each other.
More compactly, the above expression can be written as follows:
To summarize:
Choose features that are indicative of anomalous behaviour (general properties that define an instance).
Fit parameters given by,
Given a new example , compute :
Anomaly if
Single real-valued evaluation metrics would help in considering or rejecting a choice for improvement of an anomaly detection system.
In order to evaluate an anomaly detection system, it is important to have a labeled dataset (similar to a supervised learning algorithm). This dataset would generally be skewed with a high number of normal cases. In order to evaluate the algorithm follow the steps ( is normal and is anomalous):
A natural question arises, "If we have labeled data, why not used a supervised learning algorithm like logistic regression or SVM?".
Use anomaly detection when...
Use supervised learning when...
Feature engineering (or choosing the features which should be used) has a great deal of effect on the performance of an anomaly detection algorithm.
Indeed, since the algorithm tries to fit a Gaussian distribution through the dataset, it is always helpful if the the histogram of the data fed to the density estimation looks similar to a Gaussian bell shape. If the data is not in-line with the shape of a Gaussian bell curve, sometimes a transformation can help bring the feature closer to a Gaussian approximation.
Some of the popular transforms used are,
The density estimation seen earlier had the underlying assumption that the features are independent of each other. While the assumption simplifies the analysis, there are various downsides to the assumption as well.
Consider the data as shown in the plot below. It can be seen clearly that there is some correlation (negative correlation to be exact) among the two features.
Univariate Gaussian distribution ("Normal distribution" or "bell curve") applied to this data results in the following countour plot. Because while the two features are negatively correlated, the contour plot do not show any such dependency. On the contrary, if multivariate gaussian distribution is applied to the same data one can point out the correlation. Seeing the difference, it is also clear that the chances of test sets (red points) being marked as normal is lower in multivariate Gaussian than in the other.
So, mutlivariate gaussian distribution basically helps model in one go, unlike univariate gaussian that models individual features in .
The multivariate gaussian distribution is given by,
where,
The density estimation for multivariate gaussian distribution can be done using the following 2 formulas:
Steps in multivariate density estimation:
The covariance matrix is the term that brings in the major difference between the univariate and the multivariate gaussian. The effect of covariance matrix and mean shifting can be seen in the plots below.
A covariance matrix is always symmetric about the main diagonal.
Also, the original model in univariate gaussian is a special case of the multivariate gaussian distribution where the off-diagonal elements of the covariance matrix are constrained to be zero (contours are axis aligned).
Also, the original model in univariate gaussian is a special case of the multivariate gaussian distribution where the off-diagonal elements of the covariance matrix are constrained to be zero (contours are axis aligned).
Univariate vs Multivariate Gaussian Distribution:
Remark: A matrix might be singular because of the presence of redundant features, i.e. two features are linearly dependent or a feature is a linear combination of a set of other features. Such matrices are non-invertible.