# Chp 11 Advanced Cluster Analysis
###### tags: `Data Mining 心得`
## Probabilistic Model-Based Clustering
### Fuzzy Clustering
* $a.k.a.$ **soft clustering**
Allows an object to belong to more than one cluster.
>[name=Omnom][color=#00a000]Fuzzy set 本身是用「類似likelihood的概念」去定義一個element屬於fuzzy set的「可能性」
>
>element的「可能性」會根據每個fuzzy set相對應的membership function去決定
>
>舉例來說:
>$A=\{a: a =0\}$
>$membership ~ function : p(x \in A) = \frac{1}{|x|+1}$
>$\rightarrow p(3 \in A)=\frac{1}{4}$
* **Terminologies**
- objects: $o_1,...,o_n$
- $k$ fuzzy clusters: $C_1,...,C_k$
- partition matrix $M_{n \times k}=[w_{ij}]$
- $w_{ij}$: membership degree of $o_i$ in fuzzy cluster $C_j$
$0\leq w_{ij} \leq 1, \forall i,j$
$\sum\limits_{j=1}^k w_{ij}=1, \forall i$
#### measure how well a fuzzy clustering fits a data
>[name=Omnom][color=#00a000]此處課本沒有特別說fuzzy cluster的center $c_j$可以如何運算
>參考[維基](https://en.wikipedia.org/wiki/Fuzzy_clustering),此處centroid可能是用$w_{ij}$作為weight做加權平均
* **Sum of the squared error (SSE)** of clustering $\mathcal{C}$
$SSE(\mathcal{C})=\sum\limits_{i=1}^n\sum\limits_{j=1}^k w_{ij}^p \|o_i-c_j\|$
Controlable parameter $p \in \mathbb{N}$
### Probabilistic Model-Based Clusters
* $k$ probabilistic clusters, $C_1,...,C_k$
with probability density functions $f_1,...,f_k$
with probabilities, $\omega_1,...,\omega_k$
* The probability that $o_i$ is generated by the set of clusters $\boldsymbol C$
$P(o_i|\boldsymbol C)=\sum\limits_{j=1}^k \omega_jf_j(o_i)$
* For independently generated objects
$P(D|\boldsymbol C)=\prod\limits_{i=1}^n P(o_i|\boldsymbol C)$
>[name=Book]To make probabilistic model-based clusters computationally feasible, we often compromise by assuming that the probability density functions are parameterized distributions.
$\Theta_1,...,\Theta_k$ be the parameters of the $k$ distributions
### Expectation-Maximization Algorithm
>[name=Omnom][color=#00a000][維基:cMeans](https://en.wikipedia.org/wiki/Fuzzy_clustering)
## Clustering High-Dimensional Data
### Problems, Challenges, and Major Methodologies
#### Problems
[curse of dimensionality](https://zh.wikipedia.org/wiki/维数灾难)
>[name=Book]clusters hidden in high-dimensional data are often significantly smaller.
>$\rightarrow$ an exponential number of possible subspaces or dimensionality reduction options
>[name=Omnom][color=#00a000]高維度下每個維度的變化,對整體evaluation ($e.g.$ distance calculation)的影響可能變小。
>甚至使得某些measure在高維度下沒有討論意義。
>
>
>以課本中的例子來說
>3個人的「距離」雖然都是$\sqrt 2$,但就「是否有共同的購買物品」來說,Ada和Bob理論上比較不相近
#### Major Methodologies
1. Subspace clustering
2. Dimensionality reduction
### Subspace Clustering Methods
#### Strategies
1. Bottom-up:
* Start from low-dimensional subspaces and search higher-dimensional subspaces
* $e.g.$ CLIQUE
2. Top-down:
* Effective only if the locality assumption holds
$i.e.$ Subspace of a cluster can be determined by the local neighborhood.
* $e.g.$ PROCLUS: k-medoid-like method
#### Correlation-Based Clustering
Discover clusters that are defined by advanced correlation models.
#### Biclustering
Cluster both objects $A$ and attributes $B$ simultaneously.
With submatrix $I \times J, A \subseteq I, B \subseteq J$
* **Requirements**:
1. Each cluster has small number of objects and attributes
2. Objects, attributes $v.s.$ clusters is not onto or 1-to-1
* **Application**
First proposed to address the needs for analyzing gene expression data
* **Types**
- **constant** values
1. on rows
2. on columns
3. (1 & 2)
- **coherent** values
1. evolution on rows

2. evolution on columns
3. pattern-based cluster (row/column 差值變化相同)

* **Method**
- Optimization-based using **$\delta$-Cluster Algorithm**
+ Terminology
+ mean of $i^{th}$ row:
$e_{iJ}=\frac{1}{|J|}\sum\limits_{j \in J} e_{ij}$
+ mean of $j^{th}$ column:
$e_{Ij}=\frac{1}{|I|}\sum\limits_{i \in I} e_{ij}$
+ mean in matrix:
$e_{IJ}=\frac{1}{|I||J|}\sum\limits_{i \in I,j \in J} e_{ij}$
+ **Residue** on each element:
$residue(e_{ij})=e_{ij}-e_{iJ}-e_{Ij}+e_{IJ}$
+ quality of the submatrix as a bicluster:
$H(I \times J)= \frac{1}{|I||J|}\sum\limits_{i \in I,j \in J}residue(e_{ij})^2$
+ $\delta$-bicluster:
$H(I \times J) \leq \delta$ for some $\delta \geq 0$
+ maximal $\delta$-bicluster:
$\nexists I' \supseteq I, J' \supseteq J ~~s.t.~~H(I' \times J') \leq \delta$ for some $\delta \geq 0$
+ **Heuristic greedy search** method to obtain a local optimal cluster
1. **Deletion phase**
Remove row or column of largest mean-squared residue until $\delta$-cluster is obtained
2. **Addition phase**
After deletion, expand $\delta$-cluster by adding row or column of the smallest mean-squared residue.
- Enumeration using **MaPle**
+ Terminology
- **p-score**$\pmatrix{ e_{i_1j_1} & e_{i_1j_2} \\ e_{i_2j_1} & e_{i_2j_2}}$:
$|(e_{i_1j_1}+e_{i_2j_2})-(e_{i_1j_2}+e_{i_2j_1})|$
- **$\delta$-pCluster**:
$\forall ~2 \times 2$ submatrices, $p\_score \leq \delta$
+ **MaPle**
- **Monotonicity** of $\delta$-pCluster:
Submatrices of $\delta$-pCluster are also $\delta$-pClusters
- Systematically enumerates every combination of conditions using a set **enumeration tree** and a **DFS**.
- Enumerate by **pattern-growth methods** for frequent pattern mining
### Dimensionality Reduction Methods and Spectral Clustering
#### Ng-Jordan-Weiss algorithm
1. Calculate **Affinity matrix, $W$**
$w_{ij}= e^{- \frac{\| \boldsymbol{o_i-o_j}\|}{\sigma ^2}}$
* $\sigma$: scaling parameter
* In such algorithm, $w_{ii}=0$
2. Calculate **matrix $A$**
Diagonal matrix $D$ where $d_{ii}=\sum\limits_{j=1}^n w_{ij}$
$A=D^{-\frac{1}{2}}WD^{-\frac{1}{2}}$
3. **Project** the original data into the new space by $k$ leading eigenvectors $\{v_1,...,v_k\}$ of $A$ with the largest eigenvalues.
Renormalizing each row in $V=[v_1,...,v_k] \in \mathbb R^{n \times k}$ as new matrix $V'$
4. Use **clustering** method to cluster each row in $V'$
$\boldsymbol o_i$ is in $j^{th}$ cluster if $i^{th}$ row of $V'$ is in $j^{th}$ cluster
## Clustering Graph and Network Data
### Similarity Measures
#### Geodesic Distance
* Def: length in terms of the number of edges of the shortest path between the vertices
* **eccentricity** of vertex $v$: $eccen(v)= \max\limits_{u \in V} \|u,v-path\|$
* **radius** of graph $G$: $r= \min\limits_{v \in V} eccen(v)$
* **diameter** of graph $G$: $d= \max\limits_{v \in V} eccen(v)$
* **peripheral vertex**: $\arg\max\limits_{v \in V} eccen(v)$
* $e.g.$

>[name=Book]Consider graph $G$ above. The eccentricity of a is 2, that is, $eccen(a) = 2$, $eccen(b) = 2$, and $eccen(c) = eccen(d) = eccen(e) = 3$. Thus, the radius of $G$ is 2, and the diameter is 3. Note that it is not necessary that $d = 2 × r$. Vertices $c$, $d$, and $e$ are peripheral vertices.
#### SimRank: Similarity Based on Random Walk and Structural Context
>[name=Book]The intuition behind similarity based on structural context is that **two vertices in a graph are similar if they are connected to similar vertices**.
1. Base on **Structural Context**
**individual in-neighborhood** of $v$: $I(v)=\{u|(u,v) \in E\}$
$s(u,v)=\frac{C}{|I(u)||I(v)|}\sum\limits_{x \in I(u)}\sum\limits_{y \in I(v)}s(x,y)$
$s(v,v)=1$
$s(u,v)=0$ if $|I(u)||I(u)|=0$
$C \in (0,1)$: decay rate
* iteratively evaluate
$s_0=\left\{ \array{0 & if ~ u \neq v\\1 & if ~ u = v} \right.$
$s_{i+1}(u,v)=\frac{C}{|I(u)||I(v)|}\sum\limits_{x \in I(u)}\sum\limits_{y \in I(v)}s_i(x,y)$
2. Base on **Random Walk**
* **individual out-neighborhood** of $v$: $O(v)=\{w|(v,w) \in E\}$
* **expected distance** from $u$ to $v$: $d(u,v)=\sum\limits_{t} P[t]l(t)$
$l(t)$: length of $:u,v-path ~t$
$P[t]=\left\{ \array{\prod\limits_{x \in u,v-path}\frac{1}{|O(x)|} & if~l(t)>0\\0 & otherwise}\right.$
* **expected meeting distance**: $m(u,v)=\sum\limits_{u,v~ end ~at~x}P[t]l(t)$
* **expected meeting probability**: $p(u,v)=\sum\limits_{u,v~ end ~at~x}P[t]C^{l(t)}$
[補充](https://blog.csdn.net/u013527419/article/details/51943785)
### Graph Clustering Methods
#### Challenges
* High computational cost
* Sophisticated graphs
* High dimensionality
* Sparsity
#### Terminology
* **sparsity of a cut** $C = (S,T)$: $\Phi=\frac{|cut~size|}{\min\{|S|,|T|\}}$
* **modularity** of a $k$-partition-clustering: $Q=\sum\limits_{i=1}^k (\frac{|E_i|}{|E_{total}|}-\left(\frac{\sum\limits_{v\in C_i} deg(v)}{2|E_{total}|}\right)^2)$
#### SCAN (Structural Clustering Algorithm for Networks)
* Neighborhood of $u$ : $\Gamma(u)=\{v|(u,v)\in E\} \cup \{u\}$
* **normalized common neighborhood size**: $\sigma(u,v)=\frac{|\Gamma(u) \cap \Gamma(v)|}{\sqrt{|\Gamma(u)||\Gamma(v)|}}$
* **$\epsilon$-neighborhood**: $N_\epsilon(u)=\{ v \in \Gamma(u)|\sigma(u,v) \geq \epsilon\}$
* **core vertex**: $|N_\epsilon(u)| \geq \text{popularity threshold }\mu$
>[name=Book]time complexity is linear with respect to the number of edges.

## Clustering with Constraints
### Categorization of Constraints
1. on Instances
* **Must-link constraints** (ml)
$ml(x,y) \&ml(y,z) \rightarrow ml(x,z)$
* **Cannot-link constraints** (cnt)
$cnt(x,y) \&ml(x,x') \&ml(y,y') \rightarrow cnt(x',y')$
2. on Clusters
3. on Similarity Measurement
### Methods for Clustering with Constraints
#### Handling Hard Constraints
$e.g.$ **COP-k-means algorithm**

#### Handling Soft Constraints
$e.g.$ **CVQE (Constrained Vector Quanti- zation Error) algorithm**
k-means clustering while enforcing constraint violation penalties (sum of distance of vertices)
#### Speeding up Constrained Clustering
$e.g.$ **clustering with obstacles**
**triangulating the region R into triangles**, and then grouping nearby points in the same triangle into microclusters, using a method similar to BIRCH or DBSCAN
