# Secion 9 Part B
[144. Demo: Agglomerative clustering with SciPy & dendrogram manipulation](https://hackmd.io/dZwXb0ydRiiYutIg4Kfn7w#144-Demo-Agglomerative-clustering-with-SciPy-amp-dendrogram-manipulation)
[145. Demo: Agglomerative clustering with sklearn](https://hackmd.io/dZwXb0ydRiiYutIg4Kfn7w#145-Demo-Agglomerative-clustering-with-sklearn)
[146. Agglomerative clustering general guidelines](https://hackmd.io/dZwXb0ydRiiYutIg4Kfn7w#146-Agglomerative-clustering-general-guidelines)
[147. Demo: Clustering cars (numerical data)](https://hackmd.io/dZwXb0ydRiiYutIg4Kfn7w#147-Demo-Clustering-cars-numerical-data)
[148. Demo: Clustering animals (categorical data)](https://hackmd.io/dZwXb0ydRiiYutIg4Kfn7w#148-Demo-Clustering-animals-categorical-data)
[149. Demo: Clustering cars (mixed data)](https://hackmd.io/dZwXb0ydRiiYutIg4Kfn7w#149-Demo-Clustering-cars-mixed-data)
[150. Chapter summary](https://hackmd.io/dZwXb0ydRiiYutIg4Kfn7w#150-Chapter-summary)
---
# 144. Demo: Agglomerative clustering with SciPy & dendrogram manipulation
## Artificial dataset

### "Height cut" based clustering
linkage_matrix = linkage(data_df, method='average')

```
cut_height = 4
clusters = fcluster(linkage_matrix, criterion='distance', t=cut_height)
```


Not much different
But start from 1 instead of 0
### Number of clusters based clustering

```
Cluster based on wanted number of clusters
clusters = fcluster(Z=linkage_matrix, t=4, criterion='maxclust')
clusters
```

### Color the dendrogram based on clustering
Create base on cluter point data idenity
Step flow
```
Get unique clusters
Get first two columns of linkage matrix (representing nodes in nodes)
Get last "original" point ID (highest)
Initialize dict that will hold cluster-link relations <= cluster-link dict ={} no ID
Iterate through all the clusters
Get all points belonging to current cluster
Iterate untill we collect all nodes associated with current cluster
Find all positions in linkage matrix where points from the current cluster are present. This positions in linkage matrix also
denote IDs of merged nodes (merged node id = position in linkage mtx + max_orig_point_id)
Sum the matrix in order to find positions where both points belong to current cluster <= 弄G1 G2 matrix (T or F(==empty)), 然後找group ID ==2 (T+T=2)
Get node locations
Variable that will be changed to True if new nodes are added to the dict <=新的點=FLASE
Iterate through the node
Get true ids of the nodes <=新的點=FLASE, 所以要找新的ID,驗一下在不在該cluster
Add new nodes
If no new nodes are added, this means that all nodes associated with current cluster are already added.
In this case break current loop and proceed to the next cluster. <=如果不在該loop 即下一cluster (就當驗車票啦)
Add merged nodes for next itteration
=====當所有點都處理好後=======
Get collor pallete having one color per cluster <==上顏色
Transform colors to hex format (dedrogram function requires it)
```
1 patch 為一個clusters

Yuk Note:
然後可以上下左右啦

重點是因為前面講到Agglomerative clustering是Bottom-up approach所以上下左右是相反啦

### Inconsistency method

Perform clustering by cutting dendogram based on obtained inconsistency threshold 2.5

Perform clustering by cutting dendogram based on obtained inconsistency threshold thold = 2.2


### Silouethe score

CLUSTER QUALITY check
Silhouette score.
有請Chat GPT 來說一下

---
# 145. Demo: Agglomerative clustering with sklearn

Adjusted Rand Index (ARI)

clustering
calucation
print
---
| Column 1 | Column 2 |
| -------- | -------- |
| ARI : 0.5637510205230709 | ARI : 0.6422512518362898 |
|  |  |
|ARI : 0.7591987071071522 |ARI : 0.7311985567707746 |
| |  |
## Digits dataset

K=3
clustering
calucation
print
只用五百點
---
| Column 1 | Column 2 |
| -------- | -------- |
| ARI = 7.249519851737537e-08 | ARI = 0.15777822431598285 |
|  |  |
| ARI = 0.04417339487518105 | ARI = 0.5016084210255642 |
|  |  |
Yuk Note:
一次寫好程式
自動選最大ARI就好了
---
# 146. Agglomerative clustering general guidelines
---
Data preprocessing guidelines.
Pros.
Cons.
Optimal use cases.
Sklearn vs scipy.
---
## Data preprocessing guidelines.
Inspect before upload
Clean and reformat the data, if required
Handle missing values
All features should be on a similar scale.
► Standard, MinMax or other scaling ..
If dataset os categorical
* Option 1:
* Perform ordinal encoding.
* Use hamming distance.
* Option 2:
* Perform one hot encoding.
* Use euclidean distance.
If mixed feature (numberical with word)
* Option 1:
* Scale the numerical features.
* Use ordinal encoding with categorical features.
* Use distance measure such as gower.
* Option 2:
* Scale the numerical features.
* One hot encode categorical features.
* Use euclidean distance.
### Linkage choice:
* Usually:
*Use "ward" when euclidean distance is used.
*Use "average" when using other distance measures.
(just general guideline)
* For data in certain domains, other linkages may work better due to the nature of data !
* Check academic papers & studies!
## Pros.
* DENDROGRAM
* A way to examine clustering of different granularity in a single view.
* Useful even when clustering is not performed!
* Number of clusters is not a required input.
* Works with most distance measures.
* Can uncover various cluster shapes.
* Resistant to noise and outliers to some extent (linkage dependent).
* Different linkages can handle different scenarios.
## Cons.
* HIGH NUMBER OF SAMPLES
* Computational complexity - can't be used with large datasets.
* Dendrogram is not easy to analyze when number of data points is large (e.g. >500).
* CAPTURING DATA STRUCTURE
* Limited power when it comes to capturing structure in the data.
* No linkage rule is perfect!
* Dendrogram "group merging" structure may cause loss of pairwise distance information.
* Not completely resistant to outliers and noise.
* Chaining effects
* SELECTING THRESHOLDS
* Sometimes it can be hard to find perfect cut height.
* Sometimes it can be hard to sport the right threshold on the inconsistency plot
* DIAGNOSTICS TOOLS
* Use **UMAP** and **silhouette scores** to additionally characterize quality of the clusters!
* Use **cophenetic correlation** to analyze dendrogram quality.
* Satisfying dendrogram quality **DOES NOT** guarantee high quality clusters!
## Optimal use cases.
* Number of data samples is **small**.
* We care about all the data **instances**.
* e.g.: Genes: the relationships between specific genes and groups of genes.
* We care about **relationships** between groups of various granularities.
* Domain **specific** data where agglomerative clustering is preferred.
* Based on **published** papers and studies
## Sklearn vs scipy.
Slightly less code compared to scipy.
Pros:
| Sklearn | Scipy |
| -------- | -------- |
| **Compatible** with other sklearn tools. | S**eparate functions** for **each** step of the workflow. |
| Allows use of **connectivity** matrix. | **Easier** dendrogram plotting. |
| Slightly **less code** compared to scipy. | Allows use of **inconsistency** method. |
| | **Allows** linkage matrix **reuse**. |
Cons:
| Sklearn | Scipy |
| -------- | -------- |
| Requires **additional** work to plot a dendrogram. | No real disadvantages |
| **Doesn't** allow linkage matrix **reuse**. | |
| **Inconsistency** method **NOT** available. | |
Sklearn: a better choice for majority of the agglomerative clustering use cases.
Scipy: no real disadvantages when it comes to agglomerative clustering and scientific python.
---
# 147. Demo: Clustering cars (numerical data)
## Load and preprocess the data
```
The car MPG (Miles Per Gallon) dataset, often known as the Auto MPG dataset, is a popular collection of data that was sourced from the 1970s and 1980s. It provides a detailed insight into various attributes of cars that were in the market during that period.
Dataset contains following columns:
1. `mpg`: Stands for Miles Per Gallon. This measures the distance in miles that a car can travel per gallon of fuel.
2. `cylinders`: Indicates the number of cylinders in the car's engine. This can be related to the power output of the engine.
3. `displacement`: A measure of the total volume of all the cylinders in an engine, typically measured in cubic inches or cubic centimeters.
4. `horsepower`: The power output of the car's engine, typically measured in horsepower.
5. `weight`: The total weight of the car, typically measured in pounds.
6. `acceleration`: A measure of how quickly the car can increase its speed, typically represented in seconds to go from 0 to 60 miles per hour.
7. `model year`: The year when the car model was released, typically represented as a two-digit number from 70 to 82 (for 1970 to 1982).
8. `origin`: A categorical variable representing the region where the car was manufactured. This is usually represented as a number: 1 for America, 2 for Europe, and 3 for Asia.
9. `car name`: The full name of the car model, typically in the format of "Manufacturer Model" (e.g., "ford torino").
Citation : Quinlan,R.. (1993). Auto MPG. UCI Machine Learning Repository. https://doi.org/10.24432/C5859H.
```
Yuk Note: load dataset 的時候最煩就是 \和**/**
Step:
```
Load dataset
Preview
Check for missing values
Preview column datatypes
Remove missing values
Get number of unique names
Give unique name to each car by adding prefix (nth_occurence_car-name)
Cast other columns to float
Plot variable value distribution
```

```
Remove origin column and save it as separate variable
Make a copy of the dataframe for latter use
Scale the data
```
## Agglomerative clustering
Cophenetic correlation : 0.7682471079646638


## Clustering using the height method
Perfrom UMAP dimensionality reduction

umap.plot.connectivity

Colored by RGB coords of PCA embedding'

Perform clustering based on height

驗一下

YukNote:好像............
然後我們還是看分佈吧,這有意思得多了
| Column 1 | Column 2 | Column 3 |
| -------- | -------- | -------- |
|  |  |  |
|  |  |  |
|  |  | |
## Clustering using inconsistency method
Calculate inconsistence matrix
depth=4

Perform clustering based on inconsistency and print clusters


驗一下

## Cluster characterization
| Column 1 | Column 2 | Column 3 |
| -------- | -------- | -------- |
|  |  |  |
|  |  |  |
|  |  | Text |
然後看一下 degree of freedom

我才不會找車的照片啦XD
---
# 148. Demo: Clustering animals (categorical data)
```
The Zoo Dataset is a comprehensive collection of data about various animals found in zoos worldwide. The dataset is composed of several attributes related to these animals, such as:
- `animal_name`: The name of the animal.
- `hair`: Indicates if the animal has hair (1 for yes, 0 for no).
- `feathers`: Indicates if the animal has feathers (1 for yes, 0 for no).
- `eggs`: Indicates if the animal lays eggs (1 for yes, 0 for no).
- `milk`: Indicates if the animal produces milk (1 for yes, 0 for no).
- `airborne`: Indicates if the animal can fly (1 for yes, 0 for no).
- `aquatic`: Indicates if the animal lives in water (1 for yes, 0 for no).
- `predator`: Indicates if the animal is a predator (1 for yes, 0 for no).
- `toothed`: Indicates if the animal has teeth (1 for yes, 0 for no).
- `backbone`: Indicates if the animal has a backbone (1 for yes, 0 for no).
- `breathes`: Indicates if the animal breathes air (1 for yes, 0 for no).
- `venomous`: Indicates if the animal is venomous (1 for yes, 0 for no).
- `fins`: Indicates if the animal has fins (1 for yes, 0 for no).
- `legs`: Number of legs the animal has (integer value).
- `tail`: Indicates if the animal has a tail (1 for yes, 0 for no).
- `domestic`: Indicates if the animal is domesticated (1 for yes, 0 for no).
- `catsize`: Indicates if the animal is cat-sized or larger (1 for yes, 0 for no).
- `class_type`: Numerical code indicating the animal's taxonomic class.
Citation : Forsyth,Richard. (1990). Zoo. UCI Machine Learning Repository. https://doi.org/10.24432/C5R59V.
```
用特徵來分~
Step
```
Load the data
Load class mapping
Check number of unique animals
Drop duplicates since there should be one species per row
Map class id to class name
Extract class as separate object and drop class from zoo_df
Set animal name as index
```

## Agglomerative clustering with hamming distance
Cophenetic correlation : 0.8584617992386253

| Column 1 | Column 2 |
| -------- | -------- |
| labels=zoo_df.index | labels=animal_class.to_numpy() ||
|  |  |
Perform clustering


```
_ = plot_cluster_dendrogram(
linkage_matrix=linkage_matrix,
dataset_df=zoo_df,
clusters=clusters,
leaf_font_size=7,
labels=animal_class.to_numpy() <=沒有打時就默認是名(品種)
```
---
# 149. Demo: Clustering cars (mixed data)
```
Load and preprocess
The 1985 Automobile Dataset is a comprehensive collection of data that captures various specifications and details about automobiles from that year. It typically includes various characteristics of the cars. Dataset contains following columns:
- `symboling`: Insurance risk rating, ranges from -3 to 3.
- `normalized-losses`: Average loss payment per insured vehicle, continuous from 65 to 256.
- `make`: Car manufacturer, e.g., BMW, Audi.
- `fuel-type`: Type of fuel used, diesel or gas.
- `aspiration`: Type of aspiration, standard (std) or turbo.
- `num-of-doors`: Number of doors, either four or two.
- `body-style`: Car body style, e.g., sedan, hatchback.
- `drive-wheels`: Type of drive wheels, 4WD, FWD, RWD.
- `engine-location`: Location of the engine, front or rear.
- `wheel-base`: Distance between front and rear wheels, continuous from 86.6 to 120.9.
- `length`: Length of the car, continuous from 141.1 to 208.1.
- `width`: Width of the car, continuous from 60.3 to 72.3.
- `height`: Height of the car, continuous from 47.8 to 59.8.
- `curb-weight`: Weight of the car without occupants, continuous from 1488 to 4066.
- `engine-type`: Type of engine, e.g., DOHC, OHCV.
- `num-of-cylinders`: Number of cylinders, e.g., four, six.
- `engine-size`: Size of the engine, continuous from 61 to 326.
- `fuel-system`: Type of fuel system, e.g., 1bbl, mpfi.
- `bore`: Diameter of each cylinder, continuous from 2.54 to 3.94.
- `stroke`: Distance piston travels in cylinder, continuous from 2.07 to 4.17.
- `compression-ratio`: Compression ratio of the engine, continuous from 7 to 23.
- `horsepower`: Engine power, continuous from 48 to 288.
- `peak-rpm`: Maximum engine speed, continuous from 4150 to 6600.
- `city-mpg`: City mileage, continuous from 13 to 49.
- `highway-mpg`: Highway mileage, continuous from 16 to 54.
- `price`: Price of the car, continuous from 5118 to 45400.
Citation : Schlimmer,Jeffrey. (1987). Automobile. UCI Machine Learning Repository. https://doi.org/10.24432/C5B01C.
```
Step
```
Load and preview the dataset
Check data for missing values
```
categ_cols = ['make', 'fuel-type', 'aspiration', 'body-style', 'drive-wheels', 'engine-location', 'engine-type', 'fuel-system']
```
Drop missing values
Visual inspection of numerical variables
```

```
Plot categorical columns
Determine number of rows needed for the grid
Create subplots
Flatten the axes array if there's more than one row
Plot the data
```

```
Auto-create price bins
Customize price bins
Bin the price so we can easily plot prices on the dendrogram
Number of cars per price bin
```
price_bins
P1 95
P3 43
P2 37
P4 10
P5 6
Name: count, dtype: int64
```
Get categorical and numerical columns
Scale numerical columns
Visual inspection of numerical variables
```

Encode categorical variables
## Perform clustering based on gower distance
Step
```
Calculate distance based on both, numerical and categorical variables
Perfrom UMAP dimensionality reduction <=花了我4.5秒=-=
```



```
Create dendrogram based on precomputed distance
Check Cophenetic correlation
Plot dendrogram
```
Cophenetic correlation : 0.7371196249213059

改標籤

```
Calculate inconsistence matrix
depth=4
Plot inconsistency scores for given depth
```

```
Perform clustering based on inconsistency and print clusters
```

驗一下

Plot cluster feature values
這個,看demo 太長了
---
# 150. Chapter summary
► Agglomerative clustering (hierarchical).
* bottom up 方法
► Dendrograms.
* 就樹狀圖
* described by linkage matrix.
* Single (minimum) linkage Complete (maximum) linkage Avergage linkage Ward linkage
► Multiple methods for getting clusters from dendrograms.
用
* Height cutting 或 Inconsistency 方法
► General guidelines for agglomerative clustering.
* 沒有太多就 SciPy and sklearn 重溫一下Silhouette score.
Datasets 就不說了XD
---