Secion 9 Part B

# Secion 9 Part B [144. Demo: Agglomerative clustering with SciPy & dendrogram manipulation](https://hackmd.io/dZwXb0ydRiiYutIg4Kfn7w#144-Demo-Agglomerative-clustering-with-SciPy-amp-dendrogram-manipulation) [145. Demo: Agglomerative clustering with sklearn](https://hackmd.io/dZwXb0ydRiiYutIg4Kfn7w#145-Demo-Agglomerative-clustering-with-sklearn) [146. Agglomerative clustering general guidelines](https://hackmd.io/dZwXb0ydRiiYutIg4Kfn7w#146-Agglomerative-clustering-general-guidelines) [147. Demo: Clustering cars (numerical data)](https://hackmd.io/dZwXb0ydRiiYutIg4Kfn7w#147-Demo-Clustering-cars-numerical-data) [148. Demo: Clustering animals (categorical data)](https://hackmd.io/dZwXb0ydRiiYutIg4Kfn7w#148-Demo-Clustering-animals-categorical-data) [149. Demo: Clustering cars (mixed data)](https://hackmd.io/dZwXb0ydRiiYutIg4Kfn7w#149-Demo-Clustering-cars-mixed-data) [150. Chapter summary](https://hackmd.io/dZwXb0ydRiiYutIg4Kfn7w#150-Chapter-summary) --- # 144. Demo: Agglomerative clustering with SciPy & dendrogram manipulation ## Artificial dataset ![image](https://hackmd.io/_uploads/H1PMYRlekg.png) ### "Height cut" based clustering linkage_matrix = linkage(data_df, method='average') ![image](https://hackmd.io/_uploads/HJe4tAgeJe.png) ``` cut_height = 4 clusters = fcluster(linkage_matrix, criterion='distance', t=cut_height) ``` ![image](https://hackmd.io/_uploads/By94iRlgkg.png) ![image](https://hackmd.io/_uploads/SywwjAexye.png) Not much different But start from 1 instead of 0 ### Number of clusters based clustering ![image](https://hackmd.io/_uploads/Hy1ghAeg1l.png) ``` Cluster based on wanted number of clusters clusters = fcluster(Z=linkage_matrix, t=4, criterion='maxclust') clusters ``` ![image](https://hackmd.io/_uploads/Sk8-3Clgyl.png) ### Color the dendrogram based on clustering Create base on cluter point data idenity Step flow ``` Get unique clusters Get first two columns of linkage matrix (representing nodes in nodes) Get last "original" point ID (highest) Initialize dict that will hold cluster-link relations <= cluster-link dict ={} no ID Iterate through all the clusters Get all points belonging to current cluster Iterate untill we collect all nodes associated with current cluster Find all positions in linkage matrix where points from the current cluster are present. This positions in linkage matrix also denote IDs of merged nodes (merged node id = position in linkage mtx + max_orig_point_id) Sum the matrix in order to find positions where both points belong to current cluster <= 弄G1 G2 matrix (T or F(==empty)), 然後找group ID ==2 (T+T=2) Get node locations Variable that will be changed to True if new nodes are added to the dict <=新的點=FLASE Iterate through the node Get true ids of the nodes <=新的點=FLASE, 所以要找新的ID，驗一下在不在該cluster Add new nodes If no new nodes are added, this means that all nodes associated with current cluster are already added. In this case break current loop and proceed to the next cluster. <=如果不在該loop 即下一cluster (就當驗車票啦) Add merged nodes for next itteration =====當所有點都處理好後======= Get collor pallete having one color per cluster <==上顏色 Transform colors to hex format (dedrogram function requires it) ``` 1 patch 為一個clusters ![image](https://hackmd.io/_uploads/HJLz7yZl1l.png) Yuk Note: 然後可以上下左右啦 ![image](https://hackmd.io/_uploads/HyQdX1Zxyg.png) 重點是因為前面講到Agglomerative clustering是Bottom-up approach所以上下左右是相反啦 ![image](https://hackmd.io/_uploads/r1F3dJWxyg.png) ### Inconsistency method ![image](https://hackmd.io/_uploads/ry6gKyblJg.png) Perform clustering by cutting dendogram based on obtained inconsistency threshold 2.5 ![image](https://hackmd.io/_uploads/HkzFt1Zxye.png) Perform clustering by cutting dendogram based on obtained inconsistency threshold thold = 2.2 ![image](https://hackmd.io/_uploads/SklaKJblJe.png) ![image](https://hackmd.io/_uploads/HJoTFk-l1e.png) ### Silouethe score ![image](https://hackmd.io/_uploads/rJjG5JWlkg.png) CLUSTER QUALITY check Silhouette score. 有請Chat GPT 來說一下 ![image](https://hackmd.io/_uploads/SkPKnJbl1g.png) --- # 145. Demo: Agglomerative clustering with sklearn ![image](https://hackmd.io/_uploads/r10b7e-eJx.png) Adjusted Rand Index (ARI) ![image](https://hackmd.io/_uploads/HJxyWHeZgJx.png) clustering calucation print --- | Column 1 | Column 2 | | -------- | -------- | | ARI : 0.5637510205230709 | ARI : 0.6422512518362898 | | ![image](https://hackmd.io/_uploads/H1IIXeWx1g.png) | ![image](https://hackmd.io/_uploads/By0DmlbxJl.png) | |ARI : 0.7591987071071522 |ARI : 0.7311985567707746 | |![image](https://hackmd.io/_uploads/BJBjXg-gkg.png) | ![image](https://hackmd.io/_uploads/HJjCXeWgyl.png) | ## Digits dataset ![image](https://hackmd.io/_uploads/ryGVNg-gJx.png) K=3 clustering calucation print 只用五百點 --- | Column 1 | Column 2 | | -------- | -------- | | ARI = 7.249519851737537e-08 | ARI = 0.15777822431598285 | | ![image](https://hackmd.io/_uploads/HJXSVgZekg.png) | ![image](https://hackmd.io/_uploads/Sk2U4g-gJl.png) | | ARI = 0.04417339487518105 | ARI = 0.5016084210255642 | | ![image](https://hackmd.io/_uploads/HJ1tNxZe1g.png) | ![image](https://hackmd.io/_uploads/rkXsVgbxJx.png) | Yuk Note: 一次寫好程式自動選最大ARI就好了 --- # 146. Agglomerative clustering general guidelines --- Data preprocessing guidelines. Pros. Cons. Optimal use cases. Sklearn vs scipy. --- ## Data preprocessing guidelines. Inspect before upload Clean and reformat the data, if required Handle missing values All features should be on a similar scale. ► Standard, MinMax or other scaling .. If dataset os categorical * Option 1: * Perform ordinal encoding. * Use hamming distance. * Option 2: * Perform one hot encoding. * Use euclidean distance. If mixed feature (numberical with word) * Option 1: * Scale the numerical features. * Use ordinal encoding with categorical features. * Use distance measure such as gower. * Option 2: * Scale the numerical features. * One hot encode categorical features. * Use euclidean distance. ### Linkage choice: * Usually: *Use "ward" when euclidean distance is used. *Use "average" when using other distance measures. (just general guideline) * For data in certain domains, other linkages may work better due to the nature of data ! * Check academic papers & studies! ## Pros. * DENDROGRAM * A way to examine clustering of different granularity in a single view. * Useful even when clustering is not performed! * Number of clusters is not a required input. * Works with most distance measures. * Can uncover various cluster shapes. * Resistant to noise and outliers to some extent (linkage dependent). * Different linkages can handle different scenarios. ## Cons. * HIGH NUMBER OF SAMPLES * Computational complexity - can't be used with large datasets. * Dendrogram is not easy to analyze when number of data points is large (e.g. >500). * CAPTURING DATA STRUCTURE * Limited power when it comes to capturing structure in the data. * No linkage rule is perfect! * Dendrogram "group merging" structure may cause loss of pairwise distance information. * Not completely resistant to outliers and noise. * Chaining effects * SELECTING THRESHOLDS * Sometimes it can be hard to find perfect cut height. * Sometimes it can be hard to sport the right threshold on the inconsistency plot * DIAGNOSTICS TOOLS * Use **UMAP** and **silhouette scores** to additionally characterize quality of the clusters! * Use **cophenetic correlation** to analyze dendrogram quality. * Satisfying dendrogram quality **DOES NOT** guarantee high quality clusters! ## Optimal use cases. * Number of data samples is **small**. * We care about all the data **instances**. * e.g.: Genes: the relationships between specific genes and groups of genes. * We care about **relationships** between groups of various granularities. * Domain **specific** data where agglomerative clustering is preferred. * Based on **published** papers and studies ## Sklearn vs scipy. Slightly less code compared to scipy. Pros: | Sklearn | Scipy | | -------- | -------- | | **Compatible** with other sklearn tools. | S**eparate functions** for **each** step of the workflow. | | Allows use of **connectivity** matrix. | **Easier** dendrogram plotting. | | Slightly **less code** compared to scipy. | Allows use of **inconsistency** method. | | | **Allows** linkage matrix **reuse**. | Cons: | Sklearn | Scipy | | -------- | -------- | | Requires **additional** work to plot a dendrogram. | No real disadvantages | | **Doesn't** allow linkage matrix **reuse**. | | | **Inconsistency** method **NOT** available. | | Sklearn: a better choice for majority of the agglomerative clustering use cases. Scipy: no real disadvantages when it comes to agglomerative clustering and scientific python. --- # 147. Demo: Clustering cars (numerical data) ## Load and preprocess the data ``` The car MPG (Miles Per Gallon) dataset, often known as the Auto MPG dataset, is a popular collection of data that was sourced from the 1970s and 1980s. It provides a detailed insight into various attributes of cars that were in the market during that period. Dataset contains following columns: 1. `mpg`: Stands for Miles Per Gallon. This measures the distance in miles that a car can travel per gallon of fuel. 2. `cylinders`: Indicates the number of cylinders in the car's engine. This can be related to the power output of the engine. 3. `displacement`: A measure of the total volume of all the cylinders in an engine, typically measured in cubic inches or cubic centimeters. 4. `horsepower`: The power output of the car's engine, typically measured in horsepower. 5. `weight`: The total weight of the car, typically measured in pounds. 6. `acceleration`: A measure of how quickly the car can increase its speed, typically represented in seconds to go from 0 to 60 miles per hour. 7. `model year`: The year when the car model was released, typically represented as a two-digit number from 70 to 82 (for 1970 to 1982). 8. `origin`: A categorical variable representing the region where the car was manufactured. This is usually represented as a number: 1 for America, 2 for Europe, and 3 for Asia. 9. `car name`: The full name of the car model, typically in the format of "Manufacturer Model" (e.g., "ford torino"). Citation : Quinlan,R.. (1993). Auto MPG. UCI Machine Learning Repository. https://doi.org/10.24432/C5859H. ``` Yuk Note: load dataset 的時候最煩就是 \和**/** Step: ``` Load dataset Preview Check for missing values Preview column datatypes Remove missing values Get number of unique names Give unique name to each car by adding prefix (nth_occurence_car-name) Cast other columns to float Plot variable value distribution ``` ![image](https://hackmd.io/_uploads/SJsEPW-xkl.png) ``` Remove origin column and save it as separate variable Make a copy of the dataframe for latter use Scale the data ``` ## Agglomerative clustering Cophenetic correlation : 0.7682471079646638 ![image](https://hackmd.io/_uploads/HySPWM-e1l.png) ![image](https://hackmd.io/_uploads/SJQKZGWeyg.png) ## Clustering using the height method Perfrom UMAP dimensionality reduction ![image](https://hackmd.io/_uploads/HkPjWzZxyg.png) umap.plot.connectivity ![image](https://hackmd.io/_uploads/HyYhbG-gJl.png) Colored by RGB coords of PCA embedding' ![image](https://hackmd.io/_uploads/ByqAbMWxJg.png) Perform clustering based on height ![image](https://hackmd.io/_uploads/H1gxfzbeJx.png) 驗一下 ![image](https://hackmd.io/_uploads/B1JZMf-eJe.png) YukNote:好像............ 然後我們還是看分佈吧，這有意思得多了 | Column 1 | Column 2 | Column 3 | | -------- | -------- | -------- | | ![image](https://hackmd.io/_uploads/ryOvMGWlyl.png) | ![image](https://hackmd.io/_uploads/rknvzfbekg.png) | ![image](https://hackmd.io/_uploads/SkbOMGZx1g.png) | | ![image](https://hackmd.io/_uploads/rJD_ffWlJe.png) | ![image](https://hackmd.io/_uploads/rkiOGfZgke.png) | ![image](https://hackmd.io/_uploads/SkgFMMZgJg.png) | | ![image](https://hackmd.io/_uploads/HJisGzblJg.png) | ![image](https://hackmd.io/_uploads/r1fhMzWgke.png) | | ## Clustering using inconsistency method Calculate inconsistence matrix depth=4 ![image](https://hackmd.io/_uploads/B1wyXzbgJx.png) Perform clustering based on inconsistency and print clusters ![image](https://hackmd.io/_uploads/rkhs7fWekl.png) ![image](https://hackmd.io/_uploads/BJp2mMbgyg.png) 驗一下 ![image](https://hackmd.io/_uploads/r1TTQMZgyg.png) ## Cluster characterization | Column 1 | Column 2 | Column 3 | | -------- | -------- | -------- | | ![image](https://hackmd.io/_uploads/ryqU4z-xkx.png) | ![image](https://hackmd.io/_uploads/BJGLNf-lJe.png) | ![image](https://hackmd.io/_uploads/HytBEM-xyx.png) | | ![image](https://hackmd.io/_uploads/Sklr4fbgkx.png) | ![image](https://hackmd.io/_uploads/rkUN4fbekl.png) | ![image](https://hackmd.io/_uploads/rk6XNfblyg.png) | | ![image](https://hackmd.io/_uploads/BJifEzbeyl.png) | ![image](https://hackmd.io/_uploads/HJwGVM-eJl.png) | Text | 然後看一下 degree of freedom ![image](https://hackmd.io/_uploads/S17KNzZeJg.png) 我才不會找車的照片啦XD --- # 148. Demo: Clustering animals (categorical data) ``` The Zoo Dataset is a comprehensive collection of data about various animals found in zoos worldwide. The dataset is composed of several attributes related to these animals, such as: - `animal_name`: The name of the animal. - `hair`: Indicates if the animal has hair (1 for yes, 0 for no). - `feathers`: Indicates if the animal has feathers (1 for yes, 0 for no). - `eggs`: Indicates if the animal lays eggs (1 for yes, 0 for no). - `milk`: Indicates if the animal produces milk (1 for yes, 0 for no). - `airborne`: Indicates if the animal can fly (1 for yes, 0 for no). - `aquatic`: Indicates if the animal lives in water (1 for yes, 0 for no). - `predator`: Indicates if the animal is a predator (1 for yes, 0 for no). - `toothed`: Indicates if the animal has teeth (1 for yes, 0 for no). - `backbone`: Indicates if the animal has a backbone (1 for yes, 0 for no). - `breathes`: Indicates if the animal breathes air (1 for yes, 0 for no). - `venomous`: Indicates if the animal is venomous (1 for yes, 0 for no). - `fins`: Indicates if the animal has fins (1 for yes, 0 for no). - `legs`: Number of legs the animal has (integer value). - `tail`: Indicates if the animal has a tail (1 for yes, 0 for no). - `domestic`: Indicates if the animal is domesticated (1 for yes, 0 for no). - `catsize`: Indicates if the animal is cat-sized or larger (1 for yes, 0 for no). - `class_type`: Numerical code indicating the animal's taxonomic class. Citation : Forsyth,Richard. (1990). Zoo. UCI Machine Learning Repository. https://doi.org/10.24432/C5R59V. ``` 用特徵來分~ Step ``` Load the data Load class mapping Check number of unique animals Drop duplicates since there should be one species per row Map class id to class name Extract class as separate object and drop class from zoo_df Set animal name as index ``` ![image](https://hackmd.io/_uploads/By2KV7blJg.png) ## Agglomerative clustering with hamming distance Cophenetic correlation : 0.8584617992386253 ![image](https://hackmd.io/_uploads/Sy-WSXbeJe.png) | Column 1 | Column 2 | | -------- | -------- | | labels=zoo_df.index | labels=animal_class.to_numpy() || | ![image](https://hackmd.io/_uploads/Sy-WSXbeJe.png) | ![image](https://hackmd.io/_uploads/H1rYSX-gyx.png) | Perform clustering ![image](https://hackmd.io/_uploads/Bk46O7bl1l.png) ![image](https://hackmd.io/_uploads/SJ2yFm-gkl.png) ``` _ = plot_cluster_dendrogram( linkage_matrix=linkage_matrix, dataset_df=zoo_df, clusters=clusters, leaf_font_size=7, labels=animal_class.to_numpy() <=沒有打時就默認是名(品種) ``` --- # 149. Demo: Clustering cars (mixed data) ``` Load and preprocess The 1985 Automobile Dataset is a comprehensive collection of data that captures various specifications and details about automobiles from that year. It typically includes various characteristics of the cars. Dataset contains following columns: - `symboling`: Insurance risk rating, ranges from -3 to 3. - `normalized-losses`: Average loss payment per insured vehicle, continuous from 65 to 256. - `make`: Car manufacturer, e.g., BMW, Audi. - `fuel-type`: Type of fuel used, diesel or gas. - `aspiration`: Type of aspiration, standard (std) or turbo. - `num-of-doors`: Number of doors, either four or two. - `body-style`: Car body style, e.g., sedan, hatchback. - `drive-wheels`: Type of drive wheels, 4WD, FWD, RWD. - `engine-location`: Location of the engine, front or rear. - `wheel-base`: Distance between front and rear wheels, continuous from 86.6 to 120.9. - `length`: Length of the car, continuous from 141.1 to 208.1. - `width`: Width of the car, continuous from 60.3 to 72.3. - `height`: Height of the car, continuous from 47.8 to 59.8. - `curb-weight`: Weight of the car without occupants, continuous from 1488 to 4066. - `engine-type`: Type of engine, e.g., DOHC, OHCV. - `num-of-cylinders`: Number of cylinders, e.g., four, six. - `engine-size`: Size of the engine, continuous from 61 to 326. - `fuel-system`: Type of fuel system, e.g., 1bbl, mpfi. - `bore`: Diameter of each cylinder, continuous from 2.54 to 3.94. - `stroke`: Distance piston travels in cylinder, continuous from 2.07 to 4.17. - `compression-ratio`: Compression ratio of the engine, continuous from 7 to 23. - `horsepower`: Engine power, continuous from 48 to 288. - `peak-rpm`: Maximum engine speed, continuous from 4150 to 6600. - `city-mpg`: City mileage, continuous from 13 to 49. - `highway-mpg`: Highway mileage, continuous from 16 to 54. - `price`: Price of the car, continuous from 5118 to 45400. Citation : Schlimmer,Jeffrey. (1987). Automobile. UCI Machine Learning Repository. https://doi.org/10.24432/C5B01C. ``` Step ``` Load and preview the dataset Check data for missing values ``` categ_cols = ['make', 'fuel-type', 'aspiration', 'body-style', 'drive-wheels', 'engine-location', 'engine-type', 'fuel-system'] ``` Drop missing values Visual inspection of numerical variables ``` ![image](https://hackmd.io/_uploads/SkBtR7Wgyl.png) ``` Plot categorical columns Determine number of rows needed for the grid Create subplots Flatten the axes array if there's more than one row Plot the data ``` ![image](https://hackmd.io/_uploads/B1FaAXbgyx.png) ``` Auto-create price bins Customize price bins Bin the price so we can easily plot prices on the dendrogram Number of cars per price bin ``` price_bins P1 95 P3 43 P2 37 P4 10 P5 6 Name: count, dtype: int64 ``` Get categorical and numerical columns Scale numerical columns Visual inspection of numerical variables ``` ![image](https://hackmd.io/_uploads/rklzeEWx1l.png) Encode categorical variables ## Perform clustering based on gower distance Step ``` Calculate distance based on both, numerical and categorical variables Perfrom UMAP dimensionality reduction <=花了我4.5秒=-= ``` ![image](https://hackmd.io/_uploads/rycrgEZgkg.png) ![image](https://hackmd.io/_uploads/ryYIeVWe1x.png) ![image](https://hackmd.io/_uploads/BkGwlVWlke.png) ``` Create dendrogram based on precomputed distance Check Cophenetic correlation Plot dendrogram ``` Cophenetic correlation : 0.7371196249213059 ![image](https://hackmd.io/_uploads/rJIWME-lJl.png) 改標籤 ![image](https://hackmd.io/_uploads/rycuME-x1g.png) ``` Calculate inconsistence matrix depth=4 Plot inconsistency scores for given depth ``` ![image](https://hackmd.io/_uploads/SJQqMNZeJg.png) ``` Perform clustering based on inconsistency and print clusters ``` ![image](https://hackmd.io/_uploads/r1bTMN-gkg.png) 驗一下 ![image](https://hackmd.io/_uploads/BkpRGVZeJe.png) Plot cluster feature values 這個，看demo 太長了 --- # 150. Chapter summary ► Agglomerative clustering (hierarchical). * bottom up 方法 ► Dendrograms. * 就樹狀圖 * described by linkage matrix. * Single (minimum) linkage Complete (maximum) linkage Avergage linkage Ward linkage ► Multiple methods for getting clusters from dendrograms. 用 * Height cutting 或 Inconsistency 方法 ► General guidelines for agglomerative clustering. * 沒有太多就 SciPy and sklearn 重溫一下Silhouette score. Datasets 就不說了XD ---