Chp 12 Outlier Detection

# Chp 12 Outlier Detection ###### tags: `Data Mining 心得` ## Outliers and Outlier Analysis * **Outliers**: It is unlikely that they follow the same distribution as the other objects in the data set. ![](https://i.imgur.com/NiSg4OR.png =50%x) * c.f. **noise**: A random error or variance in a measured variable ### Types of Outliers * **Global Outliers** - Deviates significantly from the rest of the data set. * **Contextual Outliers** - $a.k.a$ Conditional outliers - Deviates significantly with respect to a specific context of the object. * Attributes of the data objects: - Contextual attributes: + Define the object’s context. + e.g. temperature: date and location - Behavioral attributes: + Define the object’s characteristics. + e.g. temperature: temperature, humidity, and pressure. * **Local outliers** - Deviates from the local area in which it occurs. * **Collective Outliers** - The objects as a whole deviate significantly from the entire data set. Importantly, the individual data objects may not be outliers. - ![](https://i.imgur.com/v572oZK.png =50%x) ### Challenges of Outlier Detection * Modeling normal objects and outliers effectively * Application-specific outlier detection * Handling noise in outlier detection * Understandability ## Outlier Detection Methods ### Supervised Methods * The task is to learn a classifier that can recognize outliers. * Challenges: - The two classes (i.e., normal objects versus outliers) are imbalanced + Solution: over-sampling outliers - Catching as many outliers as possible + recall is more important ### Unsupervised Methods * An outlier is expected to occur far away in feature space from any of those groups of normal objects. * Challenges: - Collective Outliers(A group of outliers is hard to detect) - A data object not belonging to any cluster may be noise or an outlier. - Often costly to find clusters first and then find outliers. ### Semi-Supervised Methods * Often get help from models for normal objects learned from unsupervised methods. ## Approaches ### Statistical Approaches * Gaussian distribution $g_D$ can be used to model the normal data * If $g_D(y)$ is very low, $y$ is unlikely generated by the Gaussian model, and thus is an outlier. #### Parametric Methods * Normal Distribution(univariate): * $\hat{\mu} = \bar{x} = \frac{1}{n}\displaystyle \sum_{i=1}^n x_i$ * $\hat{\sigma}^2 = \frac{1}{n}\displaystyle \sum_{i=1}^n (x_i-\bar{x})^2$ * Exceed $\mu \pm 3\sigma$ can be regconized as an outlier * **Mahalanobis distance** from $o$ to $\bar{o}$(multivariate): * $MDist(o, \bar{o})=(o-\bar{o})^TS^{-1}(o-\bar{o})$, where $S$ is the covariance matrix * Steps: 1. Calculate the mean vector from the multivariate data set. 2. For each object $o$, calculate $MDist(o, \bar{o})$, the Mahalanobis distance from $o$ to $\bar{o}$. 3. Detect outliers in the transformed univariate data set, $\{MDist(o, \bar{o})\mid o \in D\}$. 4. If $MDist(o, \bar{o})$ is determined to be an outlier, then $o$ is regarded as an outlier as well. * $\chi^2$-statistic * $\chi^2 = \displaystyle\sum_{i=1}^n\frac{(o_i-E_i)^2}{E_i}$ * $o_i$ is the value of $o$ on the ith dimension * $E_i$ is the mean of the i-dimension #### Non-Parametric Methods * **Histogram** - e.g. ![](https://i.imgur.com/Qrpw7on.png =70%x) + $7500 can be regarded as an outlier - Step 1: Histogram construction + Do not assume any a priori statistical model + But often need user-specified parameter to learn the model - Step 2: Outlier detection + An object’s outlier score be the inverse of the volume of the bin in which the object falls + e.g. $7500: $\frac{1}{0.2%}=500$ + $385: $\frac{1}{60%}=1.67$ - Usually use kernel density estimation to estimate the probability density distribution of the data ### Proximity-Based Approaches #### **Distance-based**: * Nested Loop Method - $DB(r, \pi)-outlier = \frac{\parallel \{o'|dist(o,o')\leq r\}\parallel}{\parallel D\parallel}\leq \pi$ - $r(r\geq 0)$: distance threshold, $\pi (0<\pi<1)$: fraction thresold * ![](https://i.imgur.com/TdJ8Rck.png =80%x) * Grid-Based Method: - Given any possible point, $x$ of $C$, and any possible point, $y$, then $y$ in + level-1 cell: $dist(x,y) \leq r$ + level-2 cell: $dist(x,y) \geq r$ + ![](https://i.imgur.com/xYL19EF.png =40%x) - Pruning rule: + Let $a$ be the number of objects in cell $C$, $b_1$:the number of objects in level-1 cells, $b_2$: the number of objects in level-2 cells + Level-1 cell: If $a+b_1 > \lceil \pi,n\rceil$, then every object $o$ in $C$ is not a $DB(r, \pi)-outlier$ + Level-2 cell: If $a+b_1+b_2 < \lceil \pi,n\rceil +1$, then all object $o$ in $C$ are $DB(r, \pi)-outliers$ #### **Density-based**: * ![](https://i.imgur.com/NDsZRmM.png) * $o_1$ and $o_2$ can be identified as outliers * Distance-based method cannot identify local outliers * $dist_k(o)$: the distance between $o$ and its k-nearest neighbor * **k-distance neighborhood**: $N_k(o) = \{o'|o' \in D, dist(o, o')\leq dist_k(o)\}$ * **Reachability distance** from $o'$ to $o$: - $reachdist_k(o \leftarrow o') = max\{dist_k(o),dist(o,o')\}$ - Reachability distance is not symmetric * **Local reachability density**: - $lrd_k(o)=\frac{\parallel N_k(o)\parallel}{\displaystyle\sum_{o'\in N_k(o)} reachdist_k(o'\leftarrow o)}$ * **Local outlier factor**: - The local outlier factor is the average of the ratio of the local reachability density of $o$ and those of $o$’s k-nearest neighbors. - $LOF_k(o)=\displaystyle\sum_{o'\in N_k(o)}lrd_k(o')\ \cdot \displaystyle\sum_{o'\in N_k(o)}reachdist_k(o'\leftarrow o)$ ### Clustering-Based Approaches * Detecting outliers as objects that do not belong to any cluster. - Use DBSCAN - ![](https://i.imgur.com/T9sJeYQ.png) * Clustering-based outlier detection using distance to the closest cluster. - k-means clustering - ![](https://i.imgur.com/iHzyxxj.png =40%x) * **FindCBLOF** algorithm - Cluster-based local outlier factor (CBLOF) - ![](https://i.imgur.com/bLP0SgG.png) - Step 1. Find clusters in a data set, and sort them according to decreasing size. - Step 2. CBLOF is the product of the cluster’s size and the similarity between the point and the cluster. - The points with lowest CBLOF scores are suspected outliers. - ![](https://i.imgur.com/S87hDg7.png =40%x) * **Fixed-width clustering**: - Finding clusters is computational, so we need a linear-time algorithm. - A point is assigned to a cluster if the center of the cluster is within a predefined distance threshold from the point. - Distance threshold may be learned from the training data. ## Classification-Based Approaches * **Supervised**: - White points labeled as "normal" - Black points labeled as "outlier" - ![](https://i.imgur.com/DR4xuC9.png =50%x) * **Semi-supervised**: - ![](https://i.imgur.com/zsz4Cnx.png =70%x) ## Mining Contextual and Collective Outliers ### Transforming Contextual Outlier Detection to Conventional Outlier Detection Step 1. Identify the context of the object using the contextual attributes. Step 2. Calculate the outlier score for the object in the context using a conventional outlier detection method. * May learn a mixture model, U, of the data on the contextual attributes, and another mixture model, V, of the data on the behavior attributes. * Outlier score: $S(o)=\displaystyle\sum_{U_j}p(o\in U_j)\displaystyle\sum_{V_i}p(o\in V_i)p(V_i|U_j)$ ### Mining Collective Outliers * Challenge: Exploring the structures in data * First method: Reduce the problem to conventional outlier detection * Second method: Models the expected behavior of structure units directly. ## Outlier Detection in High-Dimensional Data ### HilOut algorithm * For each object, $o$, $HilOut$ finds the k-nearest neighbors of $o$. * The weight of object $o$: - $w(o)=\displaystyle\sum_{i=1}^k dist(o, nn_i(o))$ * To reduce dimensionality, use principal components analysis(PCA) ### Finding Outliers in Subspaces * Grid-based subspace outlier detection method - If, in a subspace, we find an area that has a density that ismuch lower than average, then the area may contain outliers. - First, discretize the data into a grid in an equal-depth way.Each dimension is partitioned into $\phi$ equal-depth ranges.($f=\frac{1}{\phi}$) - Next, find sparse region. + **Sparsity coefficient** of k-dimensional cube C: $S(C)=\frac{n(C)-f^kn}{\sqrt{f^k(1-f^k)n}}$ + The smaller, the sparser. ### Modeling High-Dimensional Outliers * Angle-based outlier detection (ABOD): - ![](https://i.imgur.com/voseR0r.png =60%x) - For each point $o$, we examine the angle $\angle xoy$. A point that is an outlier (e.g., $c$), the angle variable is substantially smaller. - Angle-based outlier factor (ABOF): - $$ABOF(o)=VAR_{x,y\in D,x\neq o,y\neq o}\frac{\langle \overrightarrow{ox},\overrightarrow{oy}\rangle}{dist(o,x)^2 dist(o,y)^2}$$ - Complexity:$O(n^3)$