Chp 11 Advanced Cluster Analysis

# Chp 11 Advanced Cluster Analysis ###### tags: `Data Mining 心得` ## Probabilistic Model-Based Clustering ### Fuzzy Clustering * $a.k.a.$ **soft clustering** Allows an object to belong to more than one cluster. >[name=Omnom][color=#00a000]Fuzzy set 本身是用「類似likelihood的概念」去定義一個element屬於fuzzy set的「可能性」 > >element的「可能性」會根據每個fuzzy set相對應的membership function去決定 > >舉例來說： >$A=\{a: a =0\}$ >$membership ~ function : p(x \in A) = \frac{1}{|x|+1}$ >$\rightarrow p(3 \in A)=\frac{1}{4}$ * **Terminologies** - objects: $o_1,...,o_n$ - $k$ fuzzy clusters: $C_1,...,C_k$ - partition matrix $M_{n \times k}=[w_{ij}]$ - $w_{ij}$: membership degree of $o_i$ in fuzzy cluster $C_j$ $0\leq w_{ij} \leq 1, \forall i,j$ $\sum\limits_{j=1}^k w_{ij}=1, \forall i$ #### measure how well a fuzzy clustering fits a data >[name=Omnom][color=#00a000]此處課本沒有特別說fuzzy cluster的center $c_j$可以如何運算 >參考[維基](https://en.wikipedia.org/wiki/Fuzzy_clustering)，此處centroid可能是用$w_{ij}$作為weight做加權平均 * **Sum of the squared error (SSE)** of clustering $\mathcal{C}$ $SSE(\mathcal{C})=\sum\limits_{i=1}^n\sum\limits_{j=1}^k w_{ij}^p \|o_i-c_j\|$ Controlable parameter $p \in \mathbb{N}$ ### Probabilistic Model-Based Clusters * $k$ probabilistic clusters, $C_1,...,C_k$ with probability density functions $f_1,...,f_k$ with probabilities, $\omega_1,...,\omega_k$ * The probability that $o_i$ is generated by the set of clusters $\boldsymbol C$ $P(o_i|\boldsymbol C)=\sum\limits_{j=1}^k \omega_jf_j(o_i)$ * For independently generated objects $P(D|\boldsymbol C)=\prod\limits_{i=1}^n P(o_i|\boldsymbol C)$ >[name=Book]To make probabilistic model-based clusters computationally feasible, we often compromise by assuming that the probability density functions are parameterized distributions. $\Theta_1,...,\Theta_k$ be the parameters of the $k$ distributions ### Expectation-Maximization Algorithm >[name=Omnom][color=#00a000][維基：cMeans](https://en.wikipedia.org/wiki/Fuzzy_clustering) ## Clustering High-Dimensional Data ### Problems, Challenges, and Major Methodologies #### Problems [curse of dimensionality](https://zh.wikipedia.org/wiki/维数灾难) >[name=Book]clusters hidden in high-dimensional data are often significantly smaller. >$\rightarrow$ an exponential number of possible subspaces or dimensionality reduction options >[name=Omnom][color=#00a000]高維度下每個維度的變化，對整體evaluation ($e.g.$ distance calculation)的影響可能變小。 >甚至使得某些measure在高維度下沒有討論意義。 > >![](https://i.imgur.com/zFJLQHK.png) >以課本中的例子來說 >3個人的「距離」雖然都是$\sqrt 2$，但就「是否有共同的購買物品」來說，Ada和Bob理論上比較不相近 #### Major Methodologies 1. Subspace clustering 2. Dimensionality reduction ### Subspace Clustering Methods #### Strategies 1. Bottom-up: * Start from low-dimensional subspaces and search higher-dimensional subspaces * $e.g.$ CLIQUE 2. Top-down: * Effective only if the locality assumption holds $i.e.$ Subspace of a cluster can be determined by the local neighborhood. * $e.g.$ PROCLUS: k-medoid-like method #### Correlation-Based Clustering Discover clusters that are defined by advanced correlation models. #### Biclustering Cluster both objects $A$ and attributes $B$ simultaneously. With submatrix $I \times J, A \subseteq I, B \subseteq J$ * **Requirements**: 1. Each cluster has small number of objects and attributes 2. Objects, attributes $v.s.$ clusters is not onto or 1-to-1 * **Application** First proposed to address the needs for analyzing gene expression data * **Types** - **constant** values 1. on rows 2. on columns 3. (1 & 2) - **coherent** values 1. evolution on rows ![](https://i.imgur.com/OIMr0e6.png =45%x) 2. evolution on columns 3. pattern-based cluster (row/column 差值變化相同) ![](https://i.imgur.com/hxga8fl.png =50%x) * **Method** - Optimization-based using **$\delta$-Cluster Algorithm** + Terminology + mean of $i^{th}$ row: $e_{iJ}=\frac{1}{|J|}\sum\limits_{j \in J} e_{ij}$ + mean of $j^{th}$ column: $e_{Ij}=\frac{1}{|I|}\sum\limits_{i \in I} e_{ij}$ + mean in matrix: $e_{IJ}=\frac{1}{|I||J|}\sum\limits_{i \in I,j \in J} e_{ij}$ + **Residue** on each element: $residue(e_{ij})=e_{ij}-e_{iJ}-e_{Ij}+e_{IJ}$ + quality of the submatrix as a bicluster: $H(I \times J)= \frac{1}{|I||J|}\sum\limits_{i \in I,j \in J}residue(e_{ij})^2$ + $\delta$-bicluster: $H(I \times J) \leq \delta$ for some $\delta \geq 0$ + maximal $\delta$-bicluster: $\nexists I' \supseteq I, J' \supseteq J ~~s.t.~~H(I' \times J') \leq \delta$ for some $\delta \geq 0$ + **Heuristic greedy search** method to obtain a local optimal cluster 1. **Deletion phase** Remove row or column of largest mean-squared residue until $\delta$-cluster is obtained 2. **Addition phase** After deletion, expand $\delta$-cluster by adding row or column of the smallest mean-squared residue. - Enumeration using **MaPle** + Terminology - **p-score**$\pmatrix{ e_{i_1j_1} & e_{i_1j_2} \\ e_{i_2j_1} & e_{i_2j_2}}$: $|(e_{i_1j_1}+e_{i_2j_2})-(e_{i_1j_2}+e_{i_2j_1})|$ - **$\delta$-pCluster**: $\forall ~2 \times 2$ submatrices, $p\_score \leq \delta$ + **MaPle** - **Monotonicity** of $\delta$-pCluster: Submatrices of $\delta$-pCluster are also $\delta$-pClusters - Systematically enumerates every combination of conditions using a set **enumeration tree** and a **DFS**. - Enumerate by **pattern-growth methods** for frequent pattern mining ### Dimensionality Reduction Methods and Spectral Clustering #### Ng-Jordan-Weiss algorithm 1. Calculate **Affinity matrix, $W$** $w_{ij}= e^{- \frac{\| \boldsymbol{o_i-o_j}\|}{\sigma ^2}}$ * $\sigma$: scaling parameter * In such algorithm, $w_{ii}=0$ 2. Calculate **matrix $A$** Diagonal matrix $D$ where $d_{ii}=\sum\limits_{j=1}^n w_{ij}$ $A=D^{-\frac{1}{2}}WD^{-\frac{1}{2}}$ 3. **Project** the original data into the new space by $k$ leading eigenvectors $\{v_1,...,v_k\}$ of $A$ with the largest eigenvalues. Renormalizing each row in $V=[v_1,...,v_k] \in \mathbb R^{n \times k}$ as new matrix $V'$ 4. Use **clustering** method to cluster each row in $V'$ $\boldsymbol o_i$ is in $j^{th}$ cluster if $i^{th}$ row of $V'$ is in $j^{th}$ cluster ## Clustering Graph and Network Data ### Similarity Measures #### Geodesic Distance * Def: length in terms of the number of edges of the shortest path between the vertices * **eccentricity** of vertex $v$: $eccen(v)= \max\limits_{u \in V} \|u,v-path\|$ * **radius** of graph $G$: $r= \min\limits_{v \in V} eccen(v)$ * **diameter** of graph $G$: $d= \max\limits_{v \in V} eccen(v)$ * **peripheral vertex**: $\arg\max\limits_{v \in V} eccen(v)$ * $e.g.$ ![](https://i.imgur.com/vU8IHvT.png =30%x) >[name=Book]Consider graph $G$ above. The eccentricity of a is 2, that is, $eccen(a) = 2$, $eccen(b) = 2$, and $eccen(c) = eccen(d) = eccen(e) = 3$. Thus, the radius of $G$ is 2, and the diameter is 3. Note that it is not necessary that $d = 2 × r$. Vertices $c$, $d$, and $e$ are peripheral vertices. #### SimRank: Similarity Based on Random Walk and Structural Context >[name=Book]The intuition behind similarity based on structural context is that **two vertices in a graph are similar if they are connected to similar vertices**. 1. Base on **Structural Context** **individual in-neighborhood** of $v$: $I(v)=\{u|(u,v) \in E\}$ $s(u,v)=\frac{C}{|I(u)||I(v)|}\sum\limits_{x \in I(u)}\sum\limits_{y \in I(v)}s(x,y)$ $s(v,v)=1$ $s(u,v)=0$ if $|I(u)||I(u)|=0$ $C \in (0,1)$: decay rate * iteratively evaluate $s_0=\left\{ \array{0 & if ~ u \neq v\\1 & if ~ u = v} \right.$ $s_{i+1}(u,v)=\frac{C}{|I(u)||I(v)|}\sum\limits_{x \in I(u)}\sum\limits_{y \in I(v)}s_i(x,y)$ 2. Base on **Random Walk** * **individual out-neighborhood** of $v$: $O(v)=\{w|(v,w) \in E\}$ * **expected distance** from $u$ to $v$: $d(u,v)=\sum\limits_{t} P[t]l(t)$ $l(t)$: length of $:u,v-path ~t$ $P[t]=\left\{ \array{\prod\limits_{x \in u,v-path}\frac{1}{|O(x)|} & if~l(t)>0\\0 & otherwise}\right.$ * **expected meeting distance**: $m(u,v)=\sum\limits_{u,v~ end ~at~x}P[t]l(t)$ * **expected meeting probability**: $p(u,v)=\sum\limits_{u,v~ end ~at~x}P[t]C^{l(t)}$ [補充](https://blog.csdn.net/u013527419/article/details/51943785) ### Graph Clustering Methods #### Challenges * High computational cost * Sophisticated graphs * High dimensionality * Sparsity #### Terminology * **sparsity of a cut** $C = (S,T)$: $\Phi=\frac{|cut~size|}{\min\{|S|,|T|\}}$ * **modularity** of a $k$-partition-clustering: $Q=\sum\limits_{i=1}^k (\frac{|E_i|}{|E_{total}|}-\left(\frac{\sum\limits_{v\in C_i} deg(v)}{2|E_{total}|}\right)^2)$ #### SCAN (Structural Clustering Algorithm for Networks) * Neighborhood of $u$ : $\Gamma(u)=\{v|(u,v)\in E\} \cup \{u\}$ * **normalized common neighborhood size**: $\sigma(u,v)=\frac{|\Gamma(u) \cap \Gamma(v)|}{\sqrt{|\Gamma(u)||\Gamma(v)|}}$ * **$\epsilon$-neighborhood**: $N_\epsilon(u)=\{ v \in 􏰀\Gamma(u)|\sigma(u,v) \geq \epsilon\}$ * **core vertex**: $|N_\epsilon(u)| \geq \text{popularity threshold }\mu$ >[name=Book]time complexity is linear with respect to the number of edges. ![](https://i.imgur.com/1qA9xQN.png =80%x) ## Clustering with Constraints ### Categorization of Constraints 1. on Instances * **Must-link constraints** (ml) $ml(x,y) \&ml(y,z) \rightarrow ml(x,z)$ * **Cannot-link constraints** (cnt) $cnt(x,y) \&ml(x,x') \&ml(y,y') \rightarrow cnt(x',y')$ 2. on Clusters 3. on Similarity Measurement ### Methods for Clustering with Constraints #### Handling Hard Constraints $e.g.$ **COP-k-means algorithm** ![](https://i.imgur.com/qYqJs0w.png =70%x) #### Handling Soft Constraints $e.g.$ **CVQE (Constrained Vector Quanti- zation Error) algorithm** k-means clustering while enforcing constraint violation penalties (sum of distance of vertices) #### Speeding up Constrained Clustering $e.g.$ **clustering with obstacles** **triangulating the region R into triangles**, and then grouping nearby points in the same triangle into microclusters, using a method similar to BIRCH or DBSCAN ![](https://i.imgur.com/Nh5ZSN8.png)