Secion 9 - HackMD

# Secion 9 [136. Chapter agenda](https://hackmd.io/SO22bwdjR--aTo6IN5G9LA?both#136-Chapter-agenda16) 137. Hierarchical & agglomerative clustering introduction 138. Dendrogram linkages & constructing dendrograms 139. Cophenetic distance & cophenetic correlation 140. Constructing dendrograms with SciPy 141. Demo: Constructing dendrograms with SciPy 142. Approaches for extracting clusters from dendrograms 143. Agglomerative clustering with SciPy and sklearn 144. Demo: Agglomerative clustering with SciPy & dendrogram manipulation 145. Demo: Agglomerative clustering with sklearn 146. Agglomerative clustering general guidelines 147. Demo: Clustering cars (numerical data) 148. Demo: Clustering animals (categorical data) 149. Demo: Clustering cars (mixed data) 150. Chapter summary --- # Agglomerative hierarchical clustering Part 1 (136-143) --- ## 136. Chapter agenda Agglomerative clustering (hierarchical) Dendrograms Advantages and disadvantages Intro: What is hierarchical clustering What is agglomerative clustering Dendrograms: Builiding dendrograms & linkage methods. Cophenetic distance & correlation Building dendrogras with scipy Multiple methods for getting clusters from dendrograms: Height cutting Number of cluster based cutting Inconsistency method Agglomerative clustering with: Scipy Sklearn General guidelines for agglomerative clustering: Preprocessing. Pros and cons. When to use agglomerative clustering. Cluster and dendrogram quality metrics Cluster Data: Artificial datasets. Iris flowers. (less successfully than before) Digits. (not successfully) Cars. Animals in a zoo. More cars! --- ## 137. Hierarchical & agglomerative clustering introduction Concept overview Hierarchical clustering. Dendrogram. Agglomerative clustering. Comparison to k-means. ### Hierarchical clustering. used to find group in data, based on features distnace based clustering algorithm based on dendrogram ### Dendrogram. Tree-like diagram to illusrate the arrangement and relationship between different data points or groups Each branch represents a grouping height of branching: the degress of dissimilarity ### Agglomerative clustering. Clustering performed by declaring dendrogram groups as cluster Building: Top-down approach: Stars with all data points in a single group Recursively divides into smaller groups Bottom-up approach(Aggolomerative): Begins with indiviual data points Successively merges into larger groups Parts: (kind of anatomy) Leafs: indivual data points V Nodes: points where group merges Branch height: indicates distance between group @YukNote: Height is not height, more like width! this is horizontal - Branch: evertything below a certain node, incding all the connected line and lower nodes@@ a subset of data that been grouped together hierarchially - Links: lines connecting nodes and/ or leafs @YukNote: Height for me | V Root node: top-most node in dendrogram Steps: Building a dendrogram using bottom up approach 'cut' the dendrogram and obtain clusters ### Comparison to k-means. Similarities: Distance based clustering algorithm. Differences: **Not** using concept of centroids. Providing "K" is **not mandatory**. All "K"-s **can** be obtained from same dendrogram. No iterating **until** convergence. Algorithm returns **same** results **each** time. --- ## 138. Dendrogram linkages & constructing dendrograms Bottom-up approach dendrogram building. Linkage matrix. Linkage methods. ### Bottom-up approach dendrogram building. Step: 1. Choose distance metric. 2. Initiates with each individual data point as its own group. 3. Progresses by continually merging pairs of smaller groups into larger ones. 4. Concludes when all data points and groups are merged into a single, unified group. #### Merging rules Each **Iteration**: two groups are merged --Most similar groups are chosen **Singleton** groups: **distance measure** is used to **determine** **similarity**. **Multiple "linkage rules**" are available for determining similarity: when one of the groups is not a singleton. The same rule is maintained for all merges. ### Linkage matrix. Dendrogram is **described** by **linkage matrix**. Each **row** in the linkage matrix represents a **merging event**. Distance represents "**height**" of the **merge** on dendrogram. #### Linakge type Poular choice * Single (minimum) linkage * the distance **between** two groups is defined as the **shortest** distance **between** two **points** in **each group** * YukNote:用最小那點來計 * Computational complexity: Low, suitable for large datasets. * Sensitivity to noise:High - Noise can create a path for chaining effect. * Sensitivity to outliers: Low - Minimum distance is less affected by outliers, leading to robustness. * Cluster shape: Elongated/Irregular - Minimum linkage allows chaining. * Chaining effect tendency: High - Minimum distances can create paths linking distant clusters. * Chaining effect is a phenomenon where clusters are combined due to a single point being close to another cluster. Similar to noise sensitivity. * Dendrogram quality: Less Informative - Chaining can obscure true hierarchical structure. * General applicability: Specialized - Best for specific cases requiring **non-compact clusters**. * Complete (maximum) linkage * the distance **between** two groups is defined as the **maximum** distance **between** two **points** in **each group** * YukNote:用最大那點來計 * Computational complexity: Low * Sensitivity to noise: Moderate - Maximum distances can mitigate but not eliminate noise. * Sensitivity to outliers: Moderate - Maximum distance can be influenced by outliers. * Cluster shape: Spherical - Maximum linkage encourages compactness. * Chaining effect tendency: Low - Maximum distances promote compactness. * Dendrogram quality: Informative - Produces relatively clear and discernible dendrograms. * General applicability: General - Good for cases where compactness is important. * Avergage linkage * The distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster * YukNote: 基礎點用自身來計，如果是cluster 就point A+B/2 (和point C組成新的) * Computational complexity:Moderate * Sensitivity to noise: Moderate - Averaging distances helps to mitigate noise. * Sensitivity to outliers: Moderate - Averaging distances mitigates, but doesn't eliminate, the impact of outliers. * Cluster shape: Variable - Averages distances, more flexible with cluster shapes. * Chaining effect tendency: Low - Moderate - Averaging distances partially mitigates chaining. * Dendrogram quality: Informative - Balanced approach gives good hierarchical representation. * General applicability: General - Versatile for various types of data. * Ward linkage * Aims to minimize the total within-cluster variance when merging clusters. * For groups A and B: * ward_dist(A, B) = SSE(AUB) - (SSE(A) + SSE(B)) * SSE is sum of squared errors: ![image](https://hackmd.io/_uploads/HJAd29w1yx.png) @yukquestion (group centroid) the group centroid, same cal by same way of the K-means algorithm. inertia for a single group it is completely identical. ward distance=inertia of the merged group minus the inertia of this smaller groups. the logic is identical As in the case of of k means. merging the nearby groups, the centroids will not move much * Computational complexity: High, less suitable for very large datasets * Sensitivity to noise: Low to Moderate - Minimizes sum of squared differences, averaging effect reduces impact of noise. * Sensitivity to outliers: Low - SSE effectively penalizes outliers, making the method relatively insensitive to them. * Cluster shape: Variable but Compact - Minimizes within-cluster variance, tends to create more balanced clusters. * Chaining effect tendency: Low - Focus on within-cluster variance minimizes chaining * Dendrogram quality: Most Informative - Minimization of within-cluster variance offers detailed hierarchy. * General but computationally intensive - Wide applicability but at computational cost. * Can only be used with Euclidean distance. | -- |Single (minimum) linkage | Complete (maximum) linkage | Avergage linkage| Ward linkage | | -------- | -------- | -------- | -------- | -------- | |Computational complexity | Low | Low | Moderate | High | |Sensitivity to noise| High | Moderate | Moderate | Low to Moderate| | Sensitivity to outliers| Low | Moderate | Moderate | Low | |Cluster shape| Elongated/Irregular | Spherical | Variable | Variable but Compact | |Chaining effect tendency | High | Low | Low - Moderate | Low | |Dendrogram quality | Less Informative | Informative | Informative | Most Informative| |General applicability | Specialized | General | General |General but computationally intensive| ## 139. Cophenetic distance & cophenetic correlation ### Cophenetic distance * Distance measure. * Measure of dissimilarity between two data points in the dendrogram * It is equal to the height at which these two points are first combined into a single cluster. 不要想得太複雜![image](https://hackmd.io/_uploads/Sy2WT3P1Jx.png) ![image](https://hackmd.io/_uploads/HyvM6nD1kx.png) ![image](https://hackmd.io/_uploads/rkar6hDkkl.png) 爬格子! ### Cophenetic correlation * Correlation between cophenetic **distances** and **original data point** distances. * Shows the quality of the dendrogram in representing dataset structure. * High values (e.g. >0.7) mean that the dendrogram does a good job of keeping similar points close together and different points far apart @YukThink:這感覺就是R 或R^2 value 啊 --- ## 140. Constructing dendrograms with SciPy ### Scipy linkage matrix `scipy.cluster.hierarchy.linkage (y, method='single', metric='euclidean',optimal_ordering=False)` > y=Data, pandas dataframe or 2d numpy array. > method=Linkage method > metric=Distance measure. Name or custom function can be provided. > optimal_ordering: If True, similar "leafs" will be placed closer together in the dendrogram. > > Computationally expensive ### Scipy dendrogram Plots the dendrogram. Returns dict containing dendrogram info `scipy.cluster.hierarchy.dendrogram (Z, p=30, truncate_mode=None, color_threshold=None, get_leaves=True, orientation='top', labels=None, count_sort=False, distance_sort=False, show_leaf_counts=True, no_plot=False, no_labels=False, leaf_font_size=None, leaf_rotation=None, leaf_label_func=None, show_contracted=False, link_color_func=None, ax=None, above_threshold_color='CB')` > z= Linkage matrix > p= whether to truncate the dendrogram > truncate_mode= how to truncate the dendrogram > color_threshold=All branches bellow the threshold will be colored with different colors. > get_leaves=Whether to return leaf related info in the output dict. > orientation= Controls dendrogram rotation > labels= leaf labels > count_sort= > distance_sort= > show_leaf_counts=displayed number of elements in each leaf > no_plot= not plot dendrogram > no_labels=leaf labels not display > leaf_font_size= > leaf_rotation= > leaf_label_func= > show_contracted= > link_color_func=Function that can be used to color each link > ax=Matplotlib ax object. > above_threshold_color= Color of all the links above threshold. ### Cophenetic distance Returns cophenetic distance ``` scipy.cluster.hierarchy.cophenet (Z, Y=None) ``` > If Y is provided, also returns cophenetic correlation. > Z=Linkage matrix > Y=Pairwise distance matrix in condensed format (upper triangle only) --- ## 141. Demo: Constructing dendrograms with SciPy ### Denrograms - iris subset Keep only petal length and petal width (only 2features) Subset dataset to two columns for easier ![image](https://hackmd.io/_uploads/HyEPcJ_y1x.png) this is what we saw at previous lecture ![image](https://hackmd.io/_uploads/BJeeiJO1Jl.png) `cophenet(linkage_matrix, pdist(iris_df_sub[numer_cols].to_numpy()))[0]` Just run it! ![image](https://hackmd.io/_uploads/Bk7Nhyu11x.png) ![image](https://hackmd.io/_uploads/HkvVhk_J1g.png) ![image](https://hackmd.io/_uploads/SJaVnydkyg.png) few point is fine... but..... the whole dataset... ![image](https://hackmd.io/_uploads/H17x6J_kJx.png) Cophenetic correlation : 0.8638786773076584 ![image](https://hackmd.io/_uploads/Sk7j31Oyyx.png) ![image](https://hackmd.io/_uploads/B1LJp1uJJg.png) ![image](https://hackmd.io/_uploads/BJq-TJu11e.png) ![image](https://hackmd.io/_uploads/SyWQayukyl.png) ### Dendrogram truncation ![image](https://hackmd.io/_uploads/SkoapJ_ykg.png) --- ## 142. Approaches for extracting clusters from dendrograms Finding clusters Extracting clusters from dendrogram. Height based dendrogram cutting. Number of clusters based cutting. Inconsistency method. Cluster quality. ### Dendrogram and clusters We want items in the same cluster to be alike, but different from items in other clusters. Dendrogram height contains information on dissimilarity. We want to "cut" links where large jumps in linkage distances occur. ### Height cutting Clusters are detected by cutting dendrogram at a **certain height**. All branches bellow the cut point are **declared clusters**. Cut height is determined by **manually analyzing** the dendrogram ### Number of clusters approach **Select number** of clusters The "cut" in the dendrogram will occur at a height that falls between the heights of the **"k-1"-th** and **"k"-th** merges. * Observed from the **top** of the **dendrogram**. OR * Observed from the **end** of the **linkage matrix**. ![image](https://hackmd.io/_uploads/SJBGzCw1Jx.png) ### Inconsistency method Find nodes that have **significantly greater height** compared to their** descendant nodes**. by comparing **node height** with its descendant node heights INCONSISTENCY SCORE CALCULATION * Select a node. * Record the node height. * Select number of levels d. * Find heights of all the nodes that are d levels bellow the node. ► The node itself is considered a level. ![image](https://hackmd.io/_uploads/SJTZ7Rvy1x.png) height, mean, SD ### Cutting procedue Determine depth. Calculate inconsistency scores for all nodes. Plot inconsistency scores. Select inconsistency threshold. If a dendrogram node and all its descendants have an inconsistent value less than or equal to threshold, then all its leaf descendants belong to the same flat cluster. ![image](https://hackmd.io/_uploads/ByEdV0PJ1x.png) ### Parameter selection **Depth** - no clear guidelines. * 3~5 is a decent choice. * Can be advantageous to calculate inconsistencies for multiple depths and compare results. * Inconsistency threshold: * Based on visual inspection, select threshold based on values that "stand out" from the rest. ### Pros and cons Pros: * Can select clusters with different "cut" height. * To some extent, **automates** dendrogram cutting. Cons: * **Inconsistency** can be **depth sensitive**. * Choosing **inconsistency threshold** can be hard sometimes. CLUSTER QUALITY check * Silhouette score. * UMAP. * Cluster characterization. --- ## 143. Agglomerative clustering with SciPy and sklearn FINDING CLUSTERS – PYTHON LIBS ### SciPy #### Clustering First two steps: ``` learned about: scipy.cluster.hierarchy.linkage - builds linkage matrix. scipy.cluster.hierarchy.dendrogram - plots the dendrogram. see 140.141. ``` Other functions from scipy for agglomerative clustering: ``` ▸ scipy.cluster.hierarchy.inconsistent. ▸ scipy.cluster.hierarchy.fcluster. ``` inconsistent * Calculates inconsistency coefficient for all **nodes** `scipy.cluster.hierarchy.inconsistent ` > Z=linkage matrix > d=depth > Rows correspond to the linkage matrix rows. fcluster *Function, returns array containing cluster id for each data point ``` scipy.cluster.hierarchy.fcluster (Z, t, criterion='inconsistent', depth=2, R=None, monocrit=None) ``` > Z=linkage matrix > t= thershold for chosen criterion > criterion= > ▸ inconsistent - use inconsistency > ▸ distance - height cut > ▸ maxclust - height cut based on required number of clusters > depth=If "inconsistent" criteria is chosen but inconsistency matrix is not provided it will be calculated based on this depth. > R=inconsistency martix, if ==None, then automatic calculate base on depth ![image](https://hackmd.io/_uploads/ry5b9CPJ1e.png) ### sklearn #### AGGLOMERATIVE CLUSTERING `class sklearn.cluster.AgglomerativeClustering (n_clusters=2, *, affinity='deprecated', metric=None, memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', distance_threshold=None, compute_distances=False) [source]` > AgglomerativeClustering==a class from sklearn.cluster. > CLASS ARGUMENTS: > n_clusters=Number of clusters to retrieve >affinity='deprecated' >metric=None <=Distance measure name or custom function. >memory=None >connectivity=None: >>Sparse connectivity matrix (contains zeros and ones). >If False, distances between groups will not be returned. >Value 1 at row i and column j means that datapoints i and j are connected. >>Two groups can be merged only if they contain connected elements. >compute_full_tree='auto' <=If False, linkage formation will stop at n_clusters. >linkage='ward' <=Linkage method. > distance_threshold=Cut height > compute_distances=False <=If False, distances between groups will not be returned ![image](https://hackmd.io/_uploads/SyyRQyOJJg.png) ![image](https://hackmd.io/_uploads/HJtWNyuyke.png) IMPORTANT ATTRIBUTES > labels_ - data point cluster identities. >children_- contains first two columns of the linkage matrix. >distances_ - distance information from the linkage matrix ---