---
title: Clustering
tags: teach:MF
---
# Data mining and clustering
## Agenda
---
#### Target
1. Definitions of distance, similarity, and linkage.
2. [Ch 10 of the textbook](https://github.com/khanhnamle1994/statistical-learning/blob/master/Lecture-Slides/C10-Unsupervised-Learning.pdf)
3. Lab: K-means clustering using [iris data](https://archive.ics.uci.edu/ml/datasets/iris)
4. Lab: Hierarchical clustering
5. Hw 4
#### Reference materials
1. [Clutering: Lec 12 of Prof. David Sontag at MIT](https://people.csail.mit.edu/dsontag/courses/ml16/slides/lecture12.pdf)
2. [Hierarchical clustering, Lec 13 of David Sontag](https://people.csail.mit.edu/dsontag/courses/ml16/slides/lecture13.pdf)
---
## Distance, Similarity, and Linkage
#### Distance measures for numeric data points
Suppose there are two points $x=(x_1,\ldots,x_p)'$, $y=(y_1,\ldots,y_p)'$. (We use a column vector for conventions.)
The Euclidean distance between $x$ and $y$ is
$$D(x,y) =\sqrt{\sum_{i=1}^p (x_i-y_i)^2} = \sqrt{(x-y)'(x-y)}.$$
Manhattan distance:
$$D(x,y) =\sum_{i=1}^p |x_i-y_i|.$$
Minkowski measure is a generic distance metric, where Manhattan ($r=1$) and Euclidean ($r=2$) are generalization of it.
$$D(x,y) =\sqrt[r]{\sum_{i=1}^p |x_i-y_i|^r}.$$
#### Similarity
Cosine similarity:
$$\mbox{similarity} = \cos(\theta)=\frac{x'y}{\|x\|\|y\|}.$$
Pearson correlation coefficient:
$$r = \frac{\sum_{i=1}^p (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^p (x_i-\bar{x})^2}\sqrt{\sum_{i=1 }^p(y_i-\bar{y})}}.$$
Jaccard similarity:
$$\mbox{J} = \frac{|\mbox{Intersaction}(A,B)|}{|\mbox{Union}(A,B)|}$$
#### Linkage methods.
Let $C_i$ and $C_j$ denote two clusters with size $n_i$ and $n_j$, respectively.
Single linkage for clusters $C_i$ and $C_j$ is
$$D(C_i,C_j) = \min\{D(x,y)|x\in C_i, y\in C_j\}.$$
Complete linkage for clusters $C_i$ and $C_j$ is
$$D(C_i,C_j) = \max\{D(x,y)|x\in C_i, y \in C_j\}.$$
Average linkage for clusters $C_i$ and $C_j$ is
$$D(C_i,C_j) = \texttt{mean}\{D(x,y)|x\in C_i, y \in C_j\}.$$
Ward's method defines the distance between clusters $C_i$ and $C_j$ is
$$D(C_i,C_j) = \sqrt{\frac{2n_in_j}{n_i+n_j}}\|c_i-c_j \|^2,$$
where $c_i$ and $c_j$ are centroids for clusters $C_i$ and $C_j$, respectively.
---
### Iris data

---
### HW 4.
1. Project update. You just need to update your HackMD file directly. No need to submit this question to E3.
1. Update the title of your project
2. Find the data.
3. Provide the link where you download your data.
4. Write up variable descriptions and do EDA for your data.
2. JWHT, page 414, 3. (Use Python for the coding part.)
3. Computing problem.
In this homework assignment, you will download the dataset on NewE3 “seeds_dataset” to solve the following problems.
The explanation of variables can find in the resource:http://archive.ics.uci.edu/ml/datasets/seeds
Analyze the data by K-means clustering. Explain how you choose the number of cluster and illustrate the answer clearly.