Naive Bayes & Gaussian Discriminant Analysis

This is my personal notes taken for the course Machine learning by Standford. Feel free to check the assignments.
Also, if you want to read my other notes, feel free to check them at my blog.

I) Anomaly Detection System

Consider a set of points,

{x^{(1)}, x^{(2)}, \dots, x^{(m)}}

in a training example (represented by blue points) representing the regular distribution of features

x_{1}^{(i)}

and

x_{2}^{(i)}

. The aim of anomaly detection is to separate anomalies from the test set (represented by the red points) based on distribution of features in the training example (represented by the blue points). For example, in the plot below, while point A is not an outlier, point B and C in the test set can be considered to be anomalous (or outliers).

Formally, in anomaly detection the

m

training examples are considered to be normal or non-anomalous, and then the algorithm must decide if the next example,

x_{t e s t}

is anomalous or not. So given the training set, it must come up with a model

p (x)

that gives the probability of a sample being normal (high probability is normal, low probability is anomaly).Resulting decision boundary is defined by,

\begin{array}{ll} p (x_{t e s t}) & < ϵ, flag as outlier or anomaly \\ p (x_{t e x t}) & \geq ϵ, flag as normal or non-anomalous \end{array}

Some of the popular applications of anomaly detection are:

Fraud Detection: A observation set
$x_{(i)}$ would represent user
$i$ 's activities. Model
$p (x)$ is trained on the data from various users and unusual users are identified, by checking which have
$p (x (i)) < ϵ$ .
Manufacturing: Based on features of products produced on a production line, one can identify the ones with outlier characteristics for quality control and other such preventive measures.
Monitoring Systems in a Data Center: Based on characteristics of a machine behaviour such as CPU load, memory usage etc. it is possible to identify the anomalous machines and prevent failure of nodes in a data-center and initiate diagnostic measures for maximum up-time.

II) Gaussian distribution

Gaussian distribution is also called Normal Distribution.

x \in R

, and

x

follows Gaussian distribution with mean

μ

and variance

σ^{2}

, denoted as,

x \sim N (μ, σ^{2})

(The little or 'tilde' can be read as "distributed as").

A Standard normal Gaussian distribution is a bell-shaped probability distribution curve with mean

μ = 0

and standard deviation

σ = 1

, as shown in the plot below.

The parameters

μ

and

σ

signify the centring and spread of the gaussian curve as marked in the plot above. It can also be seen that the density is higher around the mean and reduces rapidly as distance from mean increases.

The probability of

x

in a gaussian distribution,

N (μ, σ^{2})

is given by,

p (x; μ, σ^{2}) = \frac{1}{\sqrt{2 π} σ} e x p (- \frac{(x - μ)^{2}}{2 σ^{2}}

where,

$μ$ is the mean.
$σ$ is the standard deviation (
$σ^{2}$ is the variance).

The effect of mean and standard deviation on a Gaussian plot can be seen clearly in figure below.

It can be noticed that, while mean

μ

defines the centering of the distribution, the standard deviation

σ

, defines the spread of the distribution. Also, as the spread increases the height of the plot decreases, because the total area under a probability distribution should always integrate to the value 1.

Given a dataset, as in the previous section,

{x^{(1)}, x^{(2)}, \dots, x^{(m)}}

, it is possible to determine the approximate (or the most fitting) gaussian distribution by using the following parameter estimation,

\begin{array}{ll} μ & = \frac{1}{m} \sum_{i = 1}^{m} x^{(i)} \\ σ^{2} & = \frac{1}{m} \sum_{i = 1}^{m} (x^{(i)} - μ)^{2} \end{array}

III) Density Estimation Algorithm

Given a training set of examples,

{x^{(1)}, \dots, x^{(m)}}

where each example is a vector,

x \in R^{n}

p (x) = p (x_{1}; μ_{1}, σ_{1}^{2}) p (x_{2}; μ_{2}, σ_{2}^{2}) \dots p (x_{n}; μ_{n}, σ_{n}^{2})

Remark:

The features

{x_{1}, \dots, x_{n}}

are independent of each other.

More compactly, the above expression can be written as follows:

\prod_{j = 1}^{n} p (x_{j}; μ_{j}, σ_{j}^{2})

To summarize:

Choose features
$x_{i}$ that are indicative of anomalous behaviour (general properties that define an instance).
Fit parameters
$μ_{1}, \dots, μ_{n}, σ_{1}^{2}, \dots, σ_{n}^{2}$ given by,

$\begin{array}{ll} μ & = \frac{1}{m} \sum_{i = 1}^{m} x^{(i)} \\ σ^{2} & = \frac{1}{m} \sum_{i = 1}^{m} (x^{(i)} - μ)^{2} \end{array}$
Given a new example
$x$ , compute
$p (x)$ :

$p (x) = \prod_{j = 1}^{n} p (x_{j}; μ_{j}, σ_{j}^{2}) = \prod_{j = 1}^{n} \frac{1}{\sqrt{2 π} σ_{j}} e x p (- \frac{(x_{j} - μ_{j})^{2}}{2 σ_{j}^{2}})$
Anomaly if
$p (x) < ϵ$

IV) Evaluation of Anomaly Detection System

Single real-valued evaluation metrics would help in considering or rejecting a choice for improvement of an anomaly detection system.

In order to evaluate an anomaly detection system, it is important to have a labeled dataset (similar to a supervised learning algorithm). This dataset would generally be skewed with a high number of normal cases. In order to evaluate the algorithm follow the steps (

y = 0

is normal and

y = 1

is anomalous):

Split the examples with
$y = 0$ into 60%-20%-20% (training-CV-test set).
Split the examples with
$y = 1$ into 50%-50% (CV-test set).
Perform density estimation algorithm on the train set.
Check the performance on the cross-validation set to find out metrics like true positive, true negative, false positive, false negative, precision/recall, f1-score. Accuracy score would not be a valid metric because in most cases the classes would be highly skewed.
Following this, the value of
$ϵ$ can be altered on the cross-validation set to improved the desired metric in the previous step.
The evaluation of the final model on the held-out test set would give a unbiased picture of how the model performs.

V) Anomaly Detection vs Supervised Learning

A natural question arises, "If we have labeled data, why not used a supervised learning algorithm like logistic regression or SVM?".

Use anomaly detection when...

We have a very small number of positive examples (
$y = 1$ ... 0-20 examples is common) and a large number of negative examples (
$y = 0$ ).
We have many different "types" of anomalies and it is hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we've seen so far.

Use supervised learning when...

We have a large number of both positive and negative examples. In other words, the training set is more evenly divided into classes.
We have enough positive examples for the algorithm to get a sense of what new positives examples look like. The future positive examples are likely similar to the ones in the training set.

VI) Feature Engineering

Feature engineering (or choosing the features which should be used) has a great deal of effect on the performance of an anomaly detection algorithm.

Indeed, since the algorithm tries to fit a Gaussian distribution through the dataset, it is always helpful if the the histogram of the data fed to the density estimation looks similar to a Gaussian bell shape. If the data is not in-line with the shape of a Gaussian bell curve, sometimes a transformation can help bring the feature closer to a Gaussian approximation.

Some of the popular transforms used are,

$l o g (x)$
$l o g (x + c)$
$\sqrt{x}$
$x^{\frac{1}{3}}$
Look at the anomaly that the algorithm is failing to flag, and see if that inspires you to create some new feature, so that with this new feature it becomes easier to distinguish the anomalies from your good examples.

VII) Multivariate Gaussian Distribution

The density estimation seen earlier had the underlying assumption that the features are independent of each other. While the assumption simplifies the analysis, there are various downsides to the assumption as well.

Consider the data as shown in the plot below. It can be seen clearly that there is some correlation (negative correlation to be exact) among the two features.

Univariate Gaussian distribution ("Normal distribution" or "bell curve") applied to this data results in the following countour plot. Because while the two features are negatively correlated, the contour plot do not show any such dependency. On the contrary, if multivariate gaussian distribution is applied to the same data one can point out the correlation. Seeing the difference, it is also clear that the chances of test sets (red points) being marked as normal is lower in multivariate Gaussian than in the other.

So, mutlivariate gaussian distribution basically helps model

p (x)

in one go, unlike univariate gaussian that models individual features

{x_{1}, x_{2}, \dots, x_{n}}

x

The multivariate gaussian distribution is given by,

p (x; μ, Σ) = \frac{1}{(2 π)^{n / 2} | Σ |^{1 / 2}} e x p (- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ))

where,

$μ \in R$ and
$Σ \in R^{n \times n}$ are the parameters of the distribution.
$| Σ |$ is the determinant of the matrix
$Σ$ (the covariance matrix).

The density estimation for multivariate gaussian distribution can be done using the following 2 formulas:

\begin{array}{ll} μ = \frac{1}{m} \sum_{i = 1}^{m} x^{(i)} \\ Σ = \frac{1}{m} \sum_{i = 1}^{m} (x^{(i)} - μ) (x^{(i)} - μ)^{T} \end{array}

Steps in multivariate density estimation:

Given a train dataset, estimate the parameters
$μ$ and
$Σ$ .
For a new example
$x$ , compute
$p (x)$ .
Flag as anomaly if
$p (x) < ϵ$ .

The covariance matrix is the term that brings in the major difference between the univariate and the multivariate gaussian. The effect of covariance matrix and mean shifting can be seen in the plots below.

A covariance matrix is always symmetric about the main diagonal.

The mean shifts the center of the distribution.
Diagonal elements are the spread of the distribution along corresponding features (also called the variance).
Off-diagonal elements are the correlation among the various features.

Also, the original model

p (x)

in univariate gaussian is a special case of the multivariate gaussian distribution where the off-diagonal elements of the covariance matrix are constrained to be zero (contours are axis aligned).

VII) Univariate vs Multivariate Gaussian Distribution

Also, the original model

p (x)

in univariate gaussian is a special case of the multivariate gaussian distribution where the off-diagonal elements of the covariance matrix are constrained to be zero (contours are axis aligned).

Univariate vs Multivariate Gaussian Distribution:

Univariate model can be used when the features are manually created to capture the anomalies and the features take unusual combinations of values. Whereas multivariate gaussian can be used when the correlation between features is to be captured as well.
Univariate model is computationally cheaper and hence scales well to the larger dataset
$(m = 10, 000 \sim 100, 000)$ , whereas the multivariate model is computationally expensive, majorly because of the term
$Σ^{- 1}$ .
Univariate model works well for smaller value of
$m$ as well. For multivariate model, must have
$m > n$ , or else
$Σ$ is singular and hence not invertible.
Generally, multivariate gaussian is used when
$m$ is much bigger than
$n$ , like
$m > 10 n$ , because
$Σ$ is a fairly large matrix with around
$\frac{n}{2}$ parameters, which would be learnt better in a setting with larger
$m$ .

Remark: A matrix might be singular because of the presence of redundant features, i.e. two features are linearly dependent or a feature is a linear combination of a set of other features. Such matrices are non-invertible.

Naive Bayes & Gaussian Discriminant Analysis

I) Anomaly Detection System

II) Gaussian distribution

III) Density Estimation Algorithm

IV) Evaluation of Anomaly Detection System

V) Anomaly Detection vs Supervised Learning

VI) Feature Engineering

VII) Multivariate Gaussian Distribution

VII) Univariate vs Multivariate Gaussian Distribution

Read more

Convolutional Neural Network with Numpy (Fast)

Convolutional Neural Network with Numpy (Slow)

AlexNet: Summary and Implementation

ZFNet/DeconvNet: Summary and Implementation