k-means - HackMD

###### tags: `clusting` # k-means - [PPT](https://docs.google.com/presentation/d/1V41e7JSFn0f_3RwTIzgKc7OYKVbMgkAS/edit?usp=sharing&ouid=105391835731695980576&rtpof=true&sd=true) - **甚麼是k-means呢?** 就是經過演算後會把資料分成k群，那這也是他的目的 - **切割** 切割分2種，一個是硬切割，另外一個是軟切割。==k-means 屬於硬切割== >硬切割: 硬切割就是每一個點一定會分到一個群，那這個點屬於這群的關係程度就是1 軟切割: 就不一定了他有可能被分到2群、3群、4群...，如果他被分到兩群，那這兩群是用關係係數去代表他被分到這兩群的關係程度。 **非監督式學習** 只有資料本身，不需要給label，就可以去做分群這個過程我們等一下會再講解，先給大家看一下過程 ### 步驟 1. 先決定要分幾群，隨機產生k個群心 2. 將每一個點分配到最近的群心 > 歐機里德距離、曼哈頓距離、餘弦距離、平均...等等都可以，可以依照題目的條件決定，全部分配完後就計算一次群心 3. 計算分配完各個的群心為多少 > 得到這個群心後，再分一次 > ![](https://i.imgur.com/s6EFfSX.gif) 5. 2~3重複運算到群心不再變動 > 直到計算前的群心和計算後的群心一樣 -> 那就代表演算已經收斂了 -> 這樣就是產生一個結果 P.S 因為初始群心會影響－＞產出不一定是最好的，所以通常會產生好幾個結果－＞再從眾多結果當中找出最好的 ### 判斷具類結果好壞 #### SSE 誤差平方和 - **點**到**所屬群的中心點**距離平方和 - K-means clustering就是希望可以最小化群內的資料和群心的誤差平方和越小越好，所以總距離越小代表各群越集中 - 若是選定k群，取最小值 ![](https://i.imgur.com/8CFhJRZ.png) #### Silhouette Coefficient 輪廓係數 - 群體之間的關係 > - a(i) = average(點到所有它屬於的群中其它點的距離) - b(i) = min(點到與它相鄰最近的一群內的所有點的平均距離) > - ai 越小，說明樣本i越應該被聚類到該簇。將ai 稱為樣本i的簇內不相似度。 - bi越大，說明樣本i越不屬於其他簇。將bi定義為樣本i的簇間不相似度 > - 用以下公式會得到 [-1,1] - si 越靠近 1 代表分群越合理 - si 越靠近 -1 代表分群越不合理 - 靠近 0 就代表 i 點在兩群邊界 ![](https://i.imgur.com/IJZ7JgE.png) ![](https://i.imgur.com/l3spAxc.png) ### 優缺點 - pros - 計算快 - 低複雜度 -> 可處理大量資料 - cons - 如果有極值(outliers) -> 影響分群精準度 > 但是要看題目是否需要極值 - 隨機群心會有不同的結果 > 如果初始群心太相近，可能導致分類結果不好 ### 參考網址 - [集群分析](https://chih-sheng-huang821.medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E9%9B%86%E7%BE%A4%E5%88%86%E6%9E%90-k-means-clustering-e608a7fe1b43) - [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) ### 程式碼 :::spoiler 範例一 ```python= import random import sys import matplotlib.pyplot as plt #畫圖 # 畫圖 def draw(dx,classify,distortions,best,times): color = ['orange','c', 'lightgreen','lavender','pink', 'gray', 'purple', 'k', 'm', 'y', 'g', 'b'] # 母畫布 fig = plt.figure() # 誤差平方和SSE fig.add_subplot(221) plt.title('SSE (elbow method)',fontsize=10) plt.plot(range(times),distortions) plt.plot(best,distortions[best],"go") plt.xlabel("times") # 原始數據 fig.add_subplot(223) plt.title('Original data',fontsize=10) for data in dx: plt.scatter(data[0],data[1],color='black') # 最好的分類結果 fig.add_subplot(224) plt.title('result',fontsize=10) for i in range(len(dx)): plt.scatter(dx[i][0],dx[i][1],color=color[classify[i]]) fig.tight_layout() plt.show() def SSE(data,classify,center): distances= 0 for pos in range(len(data)): cluster = classify[pos] for Dim in range((len(data[pos]))): distances += ((data[pos][Dim] - center[cluster][Dim])**2)**0.5 return distances def kMeans(k,data,center,classify,tmp,Dimension): clusterPlus = [] # 每一群的所有點加總 for c in range(k): clusterPlus.append([0 for Dim in range(Dimension)]) # 每一群有多少個點 dotNum = [0 for c in range(k)] # ============================================================= # 每個點分到最近的群心 for pos in range(len(data)): mini = sys.float_info.max # 預設mini為最大數字 for c in range(k): # 群心 dis = 0 # 計算點到群心的距離 for Dim in range(Dimension): # 維度 dis += ((data[pos][Dim] - center[c][Dim])**2)**0.5 # 距離該群心比較近 if mini > dis: mini = dis close_center = c classify.append(close_center) # 點加到對應的群 for Dim in range(Dimension): clusterPlus[close_center][Dim] += data[pos][Dim] dotNum[close_center] += 1 # 預防某一群沒有點點 if 0 in dotNum: return False # 分完後重新計算群心 for c in range(k): for Dim in range(Dimension): center[c][Dim] = clusterPlus[c][Dim]/dotNum[c] # 群心不變動->收斂 (位置可能不一樣) n = 0 for c in range(k): if center[c] in tmp: n += 1 if n == k: return True return False def result(k,data): center = [] Dimension = len(data[0]) classify = [] # 隨機群心 for c in range(k): contain = [] for Dim in range(Dimension): r = round(random.uniform(min(data)[Dim], max(data)[Dim]),2) contain.append(r) center.append(contain) # 最大收斂次數 for i in range(300): tmp = [] classify.clear() # 複製群心位置 tmp = center.copy() # 分類 if kMeans(k,data,center,classify,tmp,Dimension) == True: return classify,center return classify,center def main(): data = [[-39,32],[-22,38],[-40,23],[-33,37],[-28,37],[-5,-1],[12,1],[9,-11],[26,31],[28,25],[35,30],[37,23],[29,29],[0,0]] k = int(input("要分幾群? ")) times = 20 # 產生幾次結果，迭代次數 score = [] # SSE results = [] # 分類結果 centers = [] # 中心點 for i in range(times): # k-means classify,center = result(k,data) results.append(classify) centers.append(center) # 計算分數 score.append(SSE(data,classify,center)) # 最好的結果 best = score.index(min(score)) # 畫圖 draw(data,results[best],score,best,times) if __name__ == '__main__': main() ``` ::: :::spoiler 範例二 - 使用python 套件 ```python= import matplotlib.pyplot as plt #畫圖 from sklearn.metrics import silhouette_score from sklearn.cluster import KMeans import random import csv def draw(k,dx,distortions,scores,classify): colo = ['orange','c', 'lightgreen','lavender','pink', 'gray', 'purple', 'k', 'm', 'y', 'g', 'b'] # 母畫布 fig = plt.figure() # 誤差平方和SSE fig.add_subplot(221) plt.title('SSE (elbow method)',fontsize=10) plt.plot(range(2,11),distortions) plt.plot(k+2,distortions[k],"go") # 輪廓係數 fig.add_subplot(222) plt.title('Silhouette scores',fontsize=10) plt.plot(range(2,11),scores) plt.plot(k+2,scores[k],"go") # 原始數據 fig.add_subplot(223) plt.title('Original data',fontsize=10) for data in dx: plt.scatter(data[0],data[1],color='black') # 最好的分類結果 fig.add_subplot(224) plt.title('result',fontsize=10) for i in range(len(dx)): plt.scatter(dx[i][0],dx[i][1],color=colo[classify[i]]) fig.tight_layout() plt.show() def k_means(dx): distortions = [] # SSE 分數 scores = [] # 輪廓係數分數 results = [] # 分群結果 centers = [] # 找出最好的分群數 for k in range(2,11): # 分群 n_clusters=k kmeans = KMeans(n_clusters=k,n_init=10,max_iter=5).fit(dx) results.append(kmeans) # 中心 centers.append(kmeans.cluster_centers_) # SSE distortions.append(kmeans.inertia_) # 輪廓係數 scores.append(silhouette_score(dx, kmeans.predict(dx))) # 取結果分數最好的 k = scores.index(max(scores)) classify = results[k].predict(dx) center = centers[k] draw(k,dx,distortions,scores,classify) def main(): data = [[-39,32],[-22,38],[-40,23],[-33,37],[-28,37],[-5,-1],[12,1],[9,-11],[26,31],[28,25],[35,30],[37,23],[29,29],[0,0]] k_means(data) if __name__ == '__main__': main() ``` :::