# Machine Learning Ch.11 ## Let's get started !!! ### Grouping objects by similarity using k-means #### K-means clustering using scikit-learn The k-means algorithm belongs to the categoryof prototype-based clustering. K-means is very good at identifying clusterswith a spherical shape. ```python= from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=150, n_features=2, centers=3, cluster_std=0.5, shuffle=True, random_state=0) ``` Input the data from sklearn data set. ```python= import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], c='white', marker='o', edgecolor='black', s=50) plt.grid() plt.tight_layout() #plt.savefig('images/11_01.png', dpi=300) plt.show() ``` ![](https://i.imgur.com/Kxd0xge.png) ```python= from sklearn.cluster import KMeans km = KMeans(n_clusters=3, init='random', n_init=10, max_iter=300, tol=1e-04, random_state=0) y_km = km.fit_predict(X) ``` By using KMeans, we try to fit and predict the data. We set n_init=10 to run the k-means clustering algorithms 10 times independently with different randomcentroids to choose the final model as the one with the lowest SSE. Via the max_iter parameter, we specify the maximum number of iterations for each single run (here,300). ```python= from sklearn.cluster import KMeans km = KMeans(n_clusters=3, init='random', n_init=10, max_iter=300, tol=1e-04, random_state=0) y_km = km.fit_predict(X) plt.scatter(X[y_km == 0, 0], X[y_km == 0, 1], s=50, c='lightgreen', marker='s', edgecolor='black', label='Cluster 1') plt.scatter(X[y_km == 1, 0], X[y_km == 1, 1], s=50, c='orange', marker='o', edgecolor='black', label='Cluster 2') plt.scatter(X[y_km == 2, 0], X[y_km == 2, 1], s=50, c='lightblue', marker='v', edgecolor='black', label='Cluster 3') plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s=250, marker='*', c='red', edgecolor='black', label='Centroids') plt.legend(scatterpoints=1) plt.grid() plt.tight_layout() #plt.savefig('images/11_02.png', dpi=300) plt.show() ``` We set the number of desired clusters to 3. ![](https://i.imgur.com/ERMdpxE.png) ```python= print('Distortion: %.2f' % km.inertia_) # Distortion = 72.48 ``` ### sklearn.cluster.KMeans.interia_ : Sum of squared distances of samples to their closest cluster center. we can use a graphical tool, the so-called elbowmethod, to estimate the optimal number of clusters k for a given task. The idea behind the elbowmethod is to identify the value of k where the distortion begins to increase mostrapidly. ```python= distortions = [] for i in range(1, 11): km = KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=300, random_state=0) km.fit(X) distortions.append(km.inertia_) plt.plot(range(1, 11), distortions, marker='o') plt.xlabel('Number of clusters') plt.ylabel('Distortion') plt.tight_layout() #plt.savefig('images/11_03.png', dpi=300) plt.show() ``` ### From this plot, we can see that the distortion become lower and lower. ![](https://i.imgur.com/m0P3UUQ.png) As we can see in the following plot, the elbow is located at k=3, which is evidence that k=3 is indeed a good choice for this dataset. --- # Quantifying the quality of clustering via silhouette plots ```javascript= km = KMeans(n_clusters=3, init='random', n_init=10, max_iter=300, tol=1e-04, random_state=0) km = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=300, tol=1e-04, random_state=0) ``` Comparing the two differnt KMeans fuction and see which one do the work better. With the help of Silhouette Coefficient, we can compare it now. Silhouette Coefficient is a measure of how well samples are clustered with samples that are similar to themselves. ```python= import numpy as np from matplotlib import cm from sklearn.metrics import silhouette_samples km = KMeans(n_clusters=3, init='k-means++', n_init=10, max_iter=300, tol=1e-04, random_state=0) y_km = km.fit_predict(X) cluster_labels = np.unique(y_km) print(cluster_labels) n_clusters = cluster_labels.shape[0] print(n_clusters) silhouette_vals = silhouette_samples(X, y_km, metric='euclidean') y_ax_lower, y_ax_upper = 0, 0 yticks = [] for i, c in enumerate(cluster_labels): c_silhouette_vals = silhouette_vals[y_km == c] c_silhouette_vals.sort() y_ax_upper += len(c_silhouette_vals) color = cm.jet(float(i) / n_clusters) plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, edgecolor='none', color=color) yticks.append((y_ax_lower + y_ax_upper) / 2.) y_ax_lower += len(c_silhouette_vals) silhouette_avg = np.mean(silhouette_vals) plt.axvline(silhouette_avg, color="red", linestyle="--") plt.yticks(yticks, cluster_labels + 1) plt.ylabel('Cluster') plt.xlabel('Silhouette coefficient') plt.tight_layout() #plt.savefig('images/11_04.png', dpi=300) plt.show() ``` Through a visual inspection of the silhouette plot, we can quickly scrutinize the sizesof the different clusters and identify clusters that contain outliers. ![](https://i.imgur.com/X9ofwVP.png) Now we can see that the this data is suitable to divide into three clusters. (Ps. The unique is to extract the kinds of data inside the array) Let's see what happened if we only divide the data into two clusters. --- To see what a silhouette plot looks like for a relatively bad clustering, let's seed thek-means algorithm with only two centroids. ```python= km = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=300, tol=1e-04, random_state=0) y_km = km.fit_predict(X) plt.scatter(X[y_km == 0, 0], X[y_km == 0, 1], s=50, c='lightgreen', edgecolor='black', marker='s', label='Cluster 1') plt.scatter(X[y_km == 1, 0], X[y_km == 1, 1], s=50, c='orange', edgecolor='black', marker='o', label='Cluster 2') plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s=250, marker='*', c='red', label='Centroids') plt.legend() plt.grid() plt.tight_layout() #plt.savefig('images/11_05.png', dpi=300) plt.show() ``` ![](https://i.imgur.com/wdwtLn9.png) ```python= print('Distortion: %.2f' % km.inertia_) #The Distortion is 283.46. ``` As you can see, the result is terrible. The Distortion is 283.46. Next, we create the silhouette plot to evaluate the results. ```python= cluster_labels = np.unique(y_km) n_clusters = cluster_labels.shape[0] silhouette_vals = silhouette_samples(X, y_km, metric='euclidean') y_ax_lower, y_ax_upper = 0, 0 yticks = [] for i, c in enumerate(cluster_labels): c_silhouette_vals = silhouette_vals[y_km == c] c_silhouette_vals.sort() y_ax_upper += len(c_silhouette_vals) color = cm.jet(float(i) / n_clusters) plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, edgecolor='none', color=color) yticks.append((y_ax_lower + y_ax_upper) / 2.) y_ax_lower += len(c_silhouette_vals) silhouette_avg = np.mean(silhouette_vals) plt.axvline(silhouette_avg, color="red", linestyle="--") plt.yticks(yticks, cluster_labels + 1) plt.ylabel('Cluster') plt.xlabel('Silhouette coefficient') plt.tight_layout() #plt.savefig('images/11_06.png', dpi=300) plt.show() ``` ![](https://i.imgur.com/PBgGmcp.png) As we can see in the resulting plot, the silhouettes now have visibly different lengths and widths, which is evidence for a relatively bad or at least suboptimal clustering. <!-- And as you can see, this is the graph of Silhouette Coefficient is not equally distributed. --> ### Quiz 1. #### 1.1 + 1.2 ```python= km = KMeans(n_clusters=4, init='k-means++', n_init=10, max_iter=300, tol=1e-04, random_state=0) y_km = km.fit_predict(X) plt.scatter(X[y_km == 0, 0], X[y_km == 0, 1], s=50, c='lightgreen', marker='s', edgecolor='black', label='Cluster 1') plt.scatter(X[y_km == 1, 0], X[y_km == 1, 1], s=50, c='orange', marker='o', edgecolor='black', label='Cluster 2') plt.scatter(X[y_km == 2, 0], X[y_km == 2, 1], s=50, c='lightblue', marker='v', edgecolor='black', label='Cluster 3') plt.scatter(X[y_km == 3, 0], X[y_km == 3, 1], s=50, c='pink', marker='v', edgecolor='red', label='Cluster 4') plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s=250, marker='*', c='red', edgecolor='black', label='Centroids') plt.legend() plt.grid() plt.tight_layout() #plt.savefig('images/11_05.png', dpi=300) plt.show() # print(km.interia_) print('Distortion: %.2f' % km.inertia_) #Distortion: 62.84 ``` ![](https://i.imgur.com/Ue2xSOv.png) #### 1.3 ```python= cluster_labels = np.unique(y_km) n_clusters = cluster_labels.shape[0] silhouette_vals = silhouette_samples(X, y_km, metric='euclidean') y_ax_lower, y_ax_upper = 0, 0 yticks = [] for i, c in enumerate(cluster_labels): c_silhouette_vals = silhouette_vals[y_km == c] c_silhouette_vals.sort() y_ax_upper += len(c_silhouette_vals) color = cm.jet(float(i) / n_clusters) plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, edgecolor='none', color=color) yticks.append((y_ax_lower + y_ax_upper) / 2.) y_ax_lower += len(c_silhouette_vals) silhouette_avg = np.mean(silhouette_vals) plt.axvline(silhouette_avg, color="red", linestyle="--") plt.yticks(yticks, cluster_labels + 1) plt.ylabel('Cluster') plt.xlabel('Silhouette coefficient') plt.tight_layout() #plt.savefig('images/11_06.png', dpi=300) plt.show() ``` ![](https://i.imgur.com/3GVZopU.png) --- ## Organizing clusters as a hierarchical tree ### Grouping clusters in bottom-up fashion Using single linkage, we compute the distances between the most similar members for each pair of clusters and merge the two clusters for which the distance between the most similar members is the smallest. We compare the most dissimilar members to perform the merge. ![](https://i.imgur.com/HGDCIUD.png) We will discuss how to compute the distance matrix (step 1). But first, let's generate some random sample data to work with: the rows represent different observations (IDs 0-4), and the columns are the different features (X, Y, Z) of those samples. ```python= import pandas as pd import numpy as np np.random.seed(123) variables = ['X', 'Y', 'Z'] labels = ['ID_0', 'ID_1', 'ID_2', 'ID_3', 'ID_4'] X = np.random.random_sample([5, 3])*10 df = pd.DataFrame(X, columns=variables, index=labels) df ``` ![](https://i.imgur.com/BJqhypE.png) ## Performing hierarchical clustering on a distancematrix To calculate the distance matrix as input for the hierarchical clustering algorithm, wewill use the pdist function from SciPy's spatial.distance submodule. ```python= from scipy.spatial.distance import pdist, squareform row_dist = pd.DataFrame(squareform(pdist(df, metric='euclidean')), columns=labels, index=labels) row_dist ``` ![](https://i.imgur.com/aQikyhW.png) Next, we will apply the complete linkage agglomeration to our clusters using the linkage function from SciPy's cluster.hierarchy submodule, which returns a so-called linkage matrix. However, before we call the linkage function, let us take a careful look at the function documentation. ```python= from scipy.cluster.hierarchy import linkage row_clusters = linkage(row_dist, method='complete', metric='euclidean') # This is Incorrect. Be carefull. row_clusters = linkage(pdist(df, metric='euclidean'),d='complete')   # This is Correct. Remember!!!! ``` To take a closer look at the clustering results, we can turn clustering results into apandas DataFrame (best viewed in a Jupyter Notebook) as follows. ```python= # 1. incorrect approach: Squareform distance matrix from scipy.cluster.hierarchy import linkage row_clusters = linkage(row_dist, method='complete', metric='euclidean') pd.DataFrame(row_clusters, columns=['row label 1', 'row label 2', 'distance', 'no. of items in clust.'], index=['cluster %d' % (i + 1) for i in range(row_clusters.shape[0])]) ``` The first and second columns denote the most dissimilar members in each cluster, and the third row reports the distance between those members. The last column returns the count of the members in each cluster. <!-- We should now see the following data framecontaining the randomly generated samples. --> ![](https://i.imgur.com/jcVd47k.png) Now that we have computed the linkage matrix, we can visualize the results in theform of a dendrogram. ```python= # 2. correct approach: Condensed distance matrix row_clusters = linkage(pdist(df, metric='euclidean'), method='complete') pd.DataFrame(row_clusters, columns=['row label 1', 'row label 2', 'distance', 'no. of items in clust.'], index=['cluster %d' % (i + 1) for i in range(row_clusters.shape[0])]) ``` ```python= # 3. correct approach: Input matrix row_clusters = linkage(df.values, method='complete', metric='euclidean') pd.DataFrame(row_clusters, columns=['row label 1', 'row label 2', 'distance', 'no. of items in clust.'], index=['cluster %d' % (i + 1) for i in range(row_clusters.shape[0])]) ``` ![](https://i.imgur.com/JiBSY6W.png) ```python= from scipy.cluster.hierarchy import dendrogram # make dendrogram black (part 1/2) # from scipy.cluster.hierarchy import set_link_color_palette # set_link_color_palette(['black']) row_dendr = dendrogram(row_clusters, labels=labels, # make dendrogram black (part 2/2) # color_threshold=np.inf ) plt.tight_layout() plt.ylabel('Euclidean distance') #plt.savefig('images/11_11.png', dpi=300, # bbox_inches='tight') plt.show() ``` Such a dendrogram summarizes the different clusters that were formed during the agglomerative hierarchical clustering; for example, we can see that the samples ID_0 and ID_4, followed by ID_1 and ID_2, are the most similar ones based on theEuclidean distance metric. ![](https://i.imgur.com/w9j273r.png) --- ## Attaching dendrograms to a heat map First, we create a new figure object and define the x axis position, y axis position,width, and height of the dendrogram via the add_axes attribute. Furthermore,we rotate the dendrogram 90 degrees counter-clockwise. Next, we reorder the data in our initial DataFrame according to the clusteringlabels that can be accessed from the dendrogram object, which is essentially aPython dictionary, via the leaves key. Now, we construct the heat map from the reordered DataFrame and position itnext to the dendrogram. Finally, we will modify the aesthetics of the dendrogram by removing the axis ticks and hiding the axis spines. Also, we will add a color bar and assign the feature and sample names to the x and y axis tick labels, respectively. ```python= # plot row dendrogram fig = plt.figure(figsize=(8, 8), facecolor='white') axd = fig.add_axes([0.09, 0.1, 0.2, 0.6]) # note: for matplotlib < v1.5.1, please use orientation='right' row_dendr = dendrogram(row_clusters, orientation='left') # reorder data with respect to clustering df_rowclust = df.iloc[row_dendr['leaves'][::-1]] axd.set_xticks([]) axd.set_yticks([]) # remove axes spines from dendrogram for i in axd.spines.values(): i.set_visible(False) # plot heatmap axm = fig.add_axes([0.23, 0.1, 0.6, 0.6]) # x-pos, y-pos, width, height cax = axm.matshow(df_rowclust, interpolation='nearest', cmap='hot_r') fig.colorbar(cax) axm.set_xticklabels([''] + list(df_rowclust.columns)) axm.set_yticklabels([''] + list(df_rowclust.index)) #plt.savefig('images/11_12.png', dpi=300) plt.show() ``` ![](https://i.imgur.com/QDC0zB2.png) As we can see, the order of rows in the heat map reflects the clustering of the samples in the dendrogram. In addition to a simple dendrogram, the color-coded values of each sample and feature in the heat map provide us with a nice summary ofthe dataset. --- ## Locating regions of high density via DBSCAN To get a better understanding of what the result of DBSCAN can look like before jumping to the implementation, let's summarize what we have just learned about core points, border points, and noise points in the following figure. ![](https://i.imgur.com/4uZqnYR.png) For a more illustrative example, let's create a new dataset of half-moon-shaped structures to compare k-means clustering, hierarchical clustering, and DBSCAN. ```python= from sklearn.datasets import make_moons X, y = make_moons(n_samples=200, noise=0.05, random_state=0) plt.scatter(X[:, 0], X[:, 1]) plt.tight_layout() #plt.savefig('images/11_14.png', dpi=300) plt.show() ``` As we can see in the resulting plot, there are two visible, half-moon shaped groups consisting of 100 sample points each. ![](https://i.imgur.com/Khfkznp.png) We will start by using the k-means algorithm and complete linkage clustering to seeif one of those previously discussed clustering algorithms can successfully identify the half-moon shapes as separate clusters. K-means and hierarchical clustering: ```python= f, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3)) km = KMeans(n_clusters=2, random_state=0) y_km = km.fit_predict(X) ax1.scatter(X[y_km == 0, 0], X[y_km == 0, 1], edgecolor='black', c='lightblue', marker='o', s=40, label='cluster 1') ax1.scatter(X[y_km == 1, 0], X[y_km == 1, 1], edgecolor='black', c='red', marker='s', s=40, label='cluster 2') ax1.set_title('K-means clustering') ac = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete') y_ac = ac.fit_predict(X) ax2.scatter(X[y_ac == 0, 0], X[y_ac == 0, 1], c='lightblue', edgecolor='black', marker='o', s=40, label='Cluster 1') ax2.scatter(X[y_ac == 1, 0], X[y_ac == 1, 1], c='red', edgecolor='black', marker='s', s=40, label='Cluster 2') ax2.set_title('Agglomerative clustering') plt.legend() plt.tight_layout() #plt.savefig('images/11_15.png', dpi=300) plt.show() ``` Based on the visualized clustering results, we can see that the k-means algorithm isunable to separate the two cluster, and also the hierarchical clustering algorithm waschallenged by those complex shapes. ![](https://i.imgur.com/iPrrkgK.png) Finally, let us try the DBSCAN algorithm on this dataset to see if it can find the twohalf-moon-shaped clusters using a density-based approach. ```python= from sklearn.cluster import DBSCAN db = DBSCAN(eps=0.2, min_samples=5, metric='euclidean') y_db = db.fit_predict(X) plt.scatter(X[y_db == 0, 0], X[y_db == 0, 1], c='lightblue', marker='o', s=40, edgecolor='black', label='Cluster 1') plt.scatter(X[y_db == 1, 0], X[y_db == 1, 1], c='red', marker='s', s=40, edgecolor='black', label='Cluster 2') plt.legend() plt.tight_layout() #plt.savefig('images/11_16.png', dpi=300) plt.show() ``` The DBSCAN algorithm can successfully detect the half-moon shapes, whichhighlights one of the strength of DBSCAN: clustering data of arbitrary shapes. ![](https://i.imgur.com/oUFOo1X.png)