帳號分析 hash string clustering 201223

本週實驗

KMeans features調整
- 取 hash all 的value & word_length (字串長度)
- 調整原因：推測兩個feature關聯性高
把自訂流水號的帳號與原data串接，透過Kmeans分群結果觀察流水號相似或是字串相似的帳號會不會被分至不同群
以初步分群結果再透過minimum distance細分

實驗方法

建立三種帳號，每種帳號各 50 個，總計 150 筆，規則如下。
1. 'hm0000'+流水號
2. 'ssu'+流水號+'ssu'
3. 'asd'+流水號+'asd'
- (流水號範圍 0~50)
將原始資料也取 150 筆與流水號帳號串接，共 300 筆帳號作為 test data
將過濾後的帳號透過hash function轉成數值，並將此hash value與帳號長度做為Kmeans feature
用Kmeans做分群並將結果匯出觀察
初步分群後，將各群再做minimum edit distance matrix
並將distance matrix 透過 hierachy clustering分成樹狀圖
最後觀察三群的分群結果

實驗結果

Kmeans 分群 (3群)

結果來看這樣的feature組合效果是不錯的
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Hierachy clustering

group 0 (分成三大群)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
group 1 (僅分一大群)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
group 2 (分成兩大群)
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
從此實驗結果來看效果是非常好的，運算的時間複雜度也有所降低

Kmeans 群心數調整實驗

Kmeans 分群 (6群) (結果還可以)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Kmeans 分群 (9群) (有點over fitting了)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

下週計畫

擴大資料量做看看，觀察哪些因素會因資料量不同有落差，並做對應調整

tags: `Progress Report`