Chp 10 Cluster Analysis: Basic Concepts and Methods

# Chp 10 Cluster Analysis: Basic Concepts and Methods ###### tags: `Data Mining 心得` ## Cluster Analysis (clustering) Introduction ### What Is Cluster Analysis? * The process of **partitioning** a set of data objects (or observations) into subsets (clusterS). * Similar objects in a same cluster, * Objects in different clusters are supposed to be different. >[color=#00a000] > >>Clustering is known as **unsupervised learning** because the class label information is not present. For this reason, clustering is a form of **learning by observation**, rather than learning by examples. >>[name=Book] > >這裡提到說label information是指data特定的屬性，不見得包括clustering的依據。因為 cluster 本身通常需要透過data已知的屬性分群。 > >cluster和classification的一個差異是cluster只能提供「data間的相似程度」，而classification則是「藉由已有的資料與其類別，透過和他者的similarity進行label」 >[name=Omnom] > ### Requirements for Cluster Analysis 1. **Scalability** (high) Clustering on only a sample of a given large data set may lead to biased results 2. **Ability to deal with different types of attributes** e.g. graphs, sequences, images, and documents. 3. **Discovery of clusters with arbitrary shape** : a cluster could be of any shape. 4. **Requirements for domain knowledge to determine input parameters** It's hard to determine the parameter 5. **Deal with noisy data** (**Outliers**) Need clustering methods that are robust to noise. 6. **Incremental clustering**: incremental update, avoid recomputing a new clustering from scratch **insensitive to input order**: the change of input order doesn't change output 7. **Capability of clustering high-dimensionality data** Finding clusters of data objects in a high- dimensional space is challenging, especially considering that such data can be very sparse and highly skewed. 8. **Constraint-based** 9. **Interpretability and usability** It is important to study how an application goal may influence the selection of clustering features and clustering methods. ## Clustering methods ### Metrics 1. **The partitioning criteria** e.g. hierarchical or not 2. **Separation of clusters** e.g. clusters are mutually exclusive or not 3. **Similarity measure** e.g. * distance - often take advantage of optimization techniques - e.g. Euclidean space, road network, vector space, * connectivity based on density or continuity - can often find clusters of arbitrary shape 4. **Clustering space** search for clusters within the entire given data space or subspace >Subspace clustering discovers clusters and subspaces (often of low dimensionality) that manifest object similarity. >[name=Book] ## Methods Overview ![](https://i.imgur.com/LpgjLYq.png) ## Partitioning Methods ### k-Means: A Centroid-Based * goal: minimize $E=\sum_{i=1}^k \sum_{p \in C_i} \| p-c_i\|^2$ * method of iteration 1. calculate **mean** $c_i$ of each cluster $C_i$ 2. reselect nodes near each $c_i$ to form new cluster $C_i'$ 3. calculate **mean** $c_i'$ of each cluster $C_i'$ ![](https://i.imgur.com/rZQbqNy.png) * drawback: easily distorted by extreme values (i.e. sensitivity to outliers) ### k-Medoids * similar to k-means * use $median ~ o_i$ rather than $mean ~ c_i$ * use **absolute-error criterion** rather than mean square error #### Partitioning Around Medoids (PAM) Algorithm ![](https://i.imgur.com/jUebzae.png) #### CLARA (Clustering LARge Applications) * Uses a **random sample** of the data set. * The PAM algorithm is then applied to compute the best medoids from the sample. 1. Initial: randomly selects $k$ objects in the data set as the current medoids 2. randomly selects a current medoid $x$ and an object $y$ that is not one of the current medoids. replacing $x$ by $y$ if improve the absolute-error criterion 3. randomized search (2.) $l$ times 4. repeats randomized process (2-3.) $m$ times ## Hierarchical Methods (待修) * grouping data objects into a hierarchy or “tree” of clusters. * can encounter difficulties regarding the selection of merge or split points. * A tree structure called a **dendrogram** is commonly used to represent the process of hierarchical clustering e.g. ![](https://i.imgur.com/PueolnT.png) ### Category * **Algorithmic** Consider data objects as **deterministic** and compute clusters according to the deterministic distances between objects 1. Agglomerative 2. Divisive 3. Multiphase * **Probabilistic** Use probabilistic models to capture clusters and measure the quality of clusters by the **fitness** of models * **Bayesian** - Compute a distribution of possible clusterings (not discussed in the book) - Return a group of clustering structures and their probabilities, conditional on the given data ### Agglomerative v.s Divisive | Agglomerative | Divisive | | :--------: | :------: | | bottom-up (merge)| top-down (split) | ![](https://i.imgur.com/71vwpUI.png) ### Distance Measures in Algorithmic Methods 1. Minimum : $dist_{min}(C_i,C_j)=\min\limits_{p \in C_i, p' \in C_j}{|p-p'|}$ * nearest-neighbor clustering algorithm * single-linkage algorithm * minimal spanning tree algorithm 2. Maximum : $dist_{max}(C_i,C_j)=\max\limits_{p \in C_i, p' \in C_j}{|p-p'|}$ * farthest-neighbor clustering algorithm * complete-linkage algorithm e.g. ![](https://i.imgur.com/x9lAAD1.png) >1、2 tend to be overly sensitive to outliers or noisy data. >[name=Book] 3. Mean (Distance of Mean) : $dist_{mean}(C_i,C_j)=|m_i−m_j|$ * simplest to compute * mean vector for categoric data can be difficult or impossi 4. Average : $dist_{avg}(C_i,C_j)= \frac{1}{|C_i||C_j|}\sum\limits_{p \in C_i, p' \in C_j}{|p-p'|}$ * advantageous in that it can handle categoric as well as numeric data >3 or 4 is a compromise between 1、2 and overcomes the outlier sensitivity >[name=Book] ### BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies * using **Clustering Feature Trees** (**CF-tree**) * $CF=〈n,LS(Linear Sum),SS (Square Sum)〉$ * CF-tree is a height-balanced tree that stores the clustering features for a hierarchical clustering ![](https://i.imgur.com/kKIrIFr.png) #### CF-tree * parameters 1. branching factor, B maximum number of children per nonleaf node 2. threshold, T maximum diameter of subclusters stored at the leaf nodes * process Phase 1 * Scans the database to build an initial in-memory CF-tree * multilevel compression of the data * preserve the data’s inherent clustering structure. Phase 2 * applies a (selected) clustering algorithm to cluster the leaf nodes * removes sparse clusters as outliers and groups dense clusters into larger ones. * similar to the insertion and node split in the construc- tion of B+-trees. ### Chameleon: Multiphase Hierarchical Clustering using Dynamic Modeling * uses **dynamic modeling** to determine the similarity between pairs of clusters * based on (1) how well connected objects are within a cluster and (2) the proximity of clusters. * doesn't depend on a static, user-supplied model ![](https://i.imgur.com/LAq8G6O.png) >partition the $k$-nearest-neighbor graph into a large number of relatively small subclusters such that it minimizes the edge cut. That is, a cluster $C$ is partitioned into subclusters $C_i$ and $C_j$ so as to minimize the weight of the edges that would be cut should $C$ be bisected into $C_i$ and $C_j$ . >[name=Book] * **Relative Interconnectivity** absolute interconnectivity between $C_i$ and $C_j$ $RI(C_i,C_j)=\frac{|EC_{ \{ C_i,C_j \} }|}{\frac{1}{2}(|EC_{C_i}|+|EC_{C_j}|)}$ $EC_{ \{ C_i,C_j \}}$: edge cut as previously defined for a cluster containing both $C_i$ and $C_j$ $EC_{C_i}$: minimum sum of the cut edges that partition $C_i$ into two roughly equal parts. * **relative closeness** absolute closeness between $C_i$ and $C_j$ $RC(C_i,C_j)=\displaystyle \frac{\bar S_{EC_{ \{ C_i,C_j \} }}}{\frac{|C_i|}{|C_i|+|C_j|}\bar S_{EC_{ C_i } }+\frac{|C_j|}{|C_i|+|C_j|}\bar S_{EC_{ C_j } }}$ $\bar S_{EC_{ \{ C_i,C_j \}}}$: average weight of the edges that connect vertices in $C_i$ to vertices in $C_j$ $\bar S_{EC_{ C_i }}$: average weight of the edges that belong to the min-cut bisector of cluster $C_i$ ### Probabilistic Hierarchical Clustering #### generative model * Given probability distribution $D_p(\Omega)$ with parameter set $\Omega$. The task of learning the generative model is to find parameters in $\Omega$ $s.t.$ maximize "likelihood that $X$ is generated by the model" $L(D_p(\Omega):X)=P(X|\Omega)=\prod\limits_{i=1}^n P(x_i|\Omega)$ * "**quality of a cluster**" $Q$ formed by all the objects can be measured by the maximum likelihood - $Q(\{C_1,...,C_m\})=\prod\limits_{i=1}^m P(C_i)$ * change of $Q$ after merging $C_i$,$C_j$ - $\prod\limits_{i=1}^m P(C_i) (\frac{P(C_i \cup C_j)}{P(C_i)P(C_j)}-1)$ * **distance**: $dist(C_i,C_j)=-\log \frac{P(C_i \cup C_j)}{P(C_i)P(C_j)}$ e.g. (a) should merge (b),$c$should not merge ![](https://i.imgur.com/qthK6uo.png =60%x) #### Algorithm ![](https://i.imgur.com/j2dytvU.png) ## Density-Based Methods ### DBSCAN: Density-Based Spatial Clustering of Applications with Noise * Parameters: - $\epsilon$: Max radius of neighborhood - $MinPts$: Min number of points in neighborhood of a points * $N_\epsilon(p):=\{q \in D ~ : ~ \|p-q \| \leq \epsilon \}$ * Directly density-reachable: $q$ is directly density-reachable from $p$ $w.r.t.$ $~\epsilon, MinPts$ if - $q \in N_\epsilon(p)$ - $|N_\epsilon(p)| \geq MinPts$ ![](https://i.imgur.com/l9vnJBG.png) * Drawbacks of DBSCAN: User needs to decide $\epsilon$ and $MinPts$, which largely affects the result. ### OPTICS: Ordering Points to Identify the Clustering Structure OPTICS does not explicitly produce a data set clustering. Instead, it outputs a **cluster ordering**. #### parameters * **core-distance** $\epsilon'$: $\epsilon'$-neighborhood of $p$ has at least $MinPts$ objects. $\epsilon'$ is the minimum distance threshold that makes $p$ a core object. * **reachability-distance** from $q$: $\max\{core\_distance(q),dist(p,q)\}$ The minimum radius value that makes $p$ density-reachable from $q$. e.g. ![](https://i.imgur.com/aLT7sge.png =70%x) ### Problem of DBSCAN & OPTICS * The density changes significantly as the radius increases by a small amount. ![](https://i.imgur.com/AevKYdT.png) >[name=Omnom][color=#00a000][ARCADE: Accurate Recognition of Clusters Across Densities](http://www.chadsteel.com/pubs/ARCADE.pdf) ### DENCLUE : DENsity-based CLUstEring 1. Clustering Based on Density Distribution Functions 2. can be regarded as a generalization of several well-known clustering methods 3. invariant against noise * density estimation: the estimation of an unobservable underlying probability density function based on a set of observed data * **kernel density estimation** nonparametric density estimation approach from statistics [ref1](https://chaima.pixnet.net/blog/post/224125704-《數據分析》【專有名詞】核密度估計%28kernel-d)、[ref2](https://www.itread01.com/content/1549073003.html) estimated function: $\hat f_h(\boldsymbol x)=\frac{1}{nh}\sum\limits_{i=1}^nK( \frac{ \boldsymbol x- \boldsymbol{x_i}}{h} )$ $K()$ is a kernel (distribution function) $h$ is the bandwidth serving as a smoothing parameter. #### Algorithm * Goal: find **density attractor** $\boldsymbol x^*$, the local maximum of $\hat f_h(\boldsymbol x)$ $s.t.$ $\hat f_h(\boldsymbol x^*) \geq \xi$(noise threshold) * Procedure: $\boldsymbol x^0 = \boldsymbol x$ $\displaystyle \boldsymbol x^{j+1}=\boldsymbol x^j+\delta \frac{\nabla \hat f( \boldsymbol x^j)}{|\nabla \hat f( \boldsymbol x^j)|}$ $\displaystyle \nabla \hat f( \boldsymbol x)=\frac{1}{h^{d+2}n\sum_{i=1}^nK(\frac{\boldsymbol x- \boldsymbol x_i}{h})(\boldsymbol x- \boldsymbol x_i)}$ * Termination: $\hat f_h(\boldsymbol x^{k+1} )\leq \hat f_h(\boldsymbol x^k)$ then $\boldsymbol x^*=\boldsymbol x^k$ * Justification $x^*$ is outlier if $\hat f_h(\boldsymbol x^*) < \xi$ By using multiple density attractors connected by paths, DENCLUE can find clusters of arbitrary shape. ## Grid-Based Methods * **space-driven** * partitioning the embedding space into cells independent of the distribution of the input objects. * pro: fast processing time, which is typically independent of the number of data objects ### STING: STatistical INformation Grid Statistical information regarding the attributes in each grid cell precomputed and stored as statistical parameters Each cell at a high level is partitioned to form a number of cells at the next lower level. ![](https://i.imgur.com/7txtr3e.png) The type of distribution of a higher-level cell can be computed based on the majority of distribution types of its corresponding lower-level cells in conjunction with a threshold filtering process. * Algorithm **Iteration** 1. For each cell in the current layer, we compute the confidence interval (or estimated probability range) reflecting the cell’s relevancy to the given query. 2. The irrelevant cells are removed from further consideration. 3. Processing of the next lower level examines the remaining relevant cells. **Termination** until the bottom layer is reached. quality of STING clustering depends on the granularity of the lowest level of the grid * Pros 1. computation is **query-independent** 2. grid structure facilitates **parallel** processing and incremental updating 3. method’s efficiency: $O(n)+O(g)$ $g$ : total number of grid cells at the lowest level * Cons 1. if the bottom level of the grid structure is too coarse, it may reduce the quality of cluster analysis. 2. does not consider the spatial relationship between the children and their neighboring cells for construction of a parent cell ### CLIQUE: CLustering In QUEst * An **Apriori-like** Subspace Clustering Method * Finding density- based clusters in subspaces. * Use a **density threshold** to identify dense cells and sparse ones. * The main strategy behind CLIQUE for identifying a candidate search space uses the monotonicity of dense cells with respect to dimensionality. ![](https://i.imgur.com/0IiAYNr.png) ## Evaluation of Clustering ### Assessing Clustering Tendency task: determines whether a given data set has a nonrandom structure to avoid misleading clustering * **Hopkins Statistic** tests spatial randomness of a variable as distributed in a space. ![](https://i.imgur.com/jgNm4Wl.png) >課本描述和[維基](https://en.wikipedia.org/wiki/Hopkins_statistic)似乎有出入 >[name=Omnom][color=#00a000] ### Determining the Number of Clusters task: estimate this number even before a clustering algorithm is used to derive detailed clusters. >Determining the number of clusters is far from easy, often because the “right” number is ambiguous. >[name=Book] #### Methods 1. number of clusters = $\sqrt \frac{n}{2}$ 2. **elbow method** increasing the number of clusters can help to reduce the sum of within-cluster variance of each cluster. use the **turning point** in the curve of the sum of within-cluster variances with respect to the number of clusters. 3. **cross- validation**, a technique often used in classification ### Measuring Clustering Quality task: assess how good the resulting clusters are. * **Extrinsic** Methods compare the clustering against the ground truth and measure #### essential criteria - **Cluster homogeneity** the more pure the clusters in a clustering are, the better the clustering $e.g.$ a cluster contain exactly a label - **Cluster completeness** objects of same category should be in same cluster. - **Rag bag** + category containing objects that cannot be merged with other objects. + $i.e.$ "miscellaneous", "other" + putting a heterogeneous object into a pure cluster should be penalized more than putting it into a rag bag. - **Small cluster preservation** splitting a small category into pieces is more harmful than splitting a large category into pieces #### $e.g.$ BCubed precision and recall metrics Correctness($\boldsymbol{o_i,o_j}$)=bool$(L(\boldsymbol o_i)=L(\boldsymbol o_j) \Leftrightarrow C(\boldsymbol o_i)=C(\boldsymbol o_j))$ Precision BCubed=$\displaystyle\frac{\displaystyle\sum\limits_{i=1}^n\frac{\sum\limits_{\boldsymbol o_j|i \neq j, C(\boldsymbol o_i)=C(\boldsymbol o_j)} Correctness(\boldsymbol{o_i,o_j}) }{\| \{\boldsymbol o_j|i \neq j, C(\boldsymbol o_i)=C(\boldsymbol o_j) \} \|}}{n}$ BCubed recall =$\displaystyle\frac{\displaystyle\sum\limits_{i=1}^n\frac{\sum\limits_{\boldsymbol o_j|i \neq j, L(\boldsymbol o_i)=L(\boldsymbol o_j)} Correctness(\boldsymbol{o_i,o_j}) }{\| \{\boldsymbol o_j|i \neq j, L(\boldsymbol o_i)=L(\boldsymbol o_j) \} \|}}{n}$ * **Intrinsic** Methods: evaluate a clustering by examining how well the clusters are separated and how compact the clusters are #### $e.g.$ silhouette coefficient $s$ $\boldsymbol o \in C_i$ $\displaystyle a(\boldsymbol o)=\frac{\sum_{\boldsymbol o' \in C_i, \boldsymbol o \neq \boldsymbol o'} \| \boldsymbol o- \boldsymbol o'\|}{|C_i|-1}$ $\displaystyle b(\boldsymbol o)= \min\limits_{C_j:1 \leq j \leq k, j \neq i} \{\frac{\sum_{\boldsymbol o' \in C_j} \| \boldsymbol o- \boldsymbol o'\|}{|C_j|}\}$ $s(\boldsymbol o)=\frac{b(\boldsymbol o)-a(\boldsymbol o)}{\max \{ a(\boldsymbol o),b(\boldsymbol o)\}}$