DBSCAN

Idea
Algorithm
Advantage
Disadvantage
Reference

tags: `Machine Learning`, `Clustering`

Idea

透過設定的半徑(

ϵ

)及最少資料點數(MinPts)將資料分成三類：

核心點(core points)
對於給定點P，如果以
$ϵ$ 為半徑張開的圓中包含至少MinPts個資料點(包括P自己)，則稱P為核心點。
邊界點(border points)
對於給定點P，如果以
$ϵ$ 為半徑張開的圓中包含小於MinPts個資料點(包括P自己)，但P包含於某個核心點張出的圓中，則稱P為邊界點。
雜訊(Noise)
對於給定點P，如果以
$ϵ$ 為半徑張開的圓中包含至少MinPts個資料點(包括P自己)，且P不包含於任意核心點張出的圓中，則稱P為雜訊。

Algorithm

返回 DBSCAN
返回 Clustering

DBSCAN(D, eps, MinPts) {
	C = 0
	for each point P in dataset D {
        P is visited
            continue next point
        rk P as visited
        NeighborPts = regionQuery(P, eps)
        if sizeof(NeighborPts) < MinPts
            mark P as NOISE
	else {
            C = next cluster
            expandCluster(P, NeighborPts, C, eps, MinPts)
        }
	}
}

expandCluster(P, NeighborPts, C, eps, MinPts) {
	add P to cluster C
	for each point Q in NeighborPts {
		if Q is not visited {
			mark Q as visited
			NeighborPts_Q = regionQuery(Q, eps)
			if sizeof(NeighborPts_Q) >= MinPts
                NeighborPts = NeighborPts joined with NeighborPts_Q
		}
		if Q is not yet member of any cluster
			add Q to cluster C
	}
}

regionQuery(P, eps)
    return all points within in the eps-neighborhood of P (including P)

翻譯成白話文就是：

# def DBSCAN(資料D、半徑epsilon、MinPts)
   # 觀察每一個資料點(P)
      # 如果P已經看過
         # 找下一個資料點
      # 把P標記成已看過
      # 找出P的鄰居
      # 如果鄰居數量<MinPts
         # 先把P當成NOISE (如是邊界點之後會有核心點來領走它)
      # 如果鄰居數量>=MinPts
         # 就把P當成一個新的群體
         # 並且找出所有屬於這個群體的點

# def 找出所有屬於群體的點(核心點P、P的鄰居、群組C、半徑epsilon、MinPts)
   # 把P加入群組C
   # 考慮P的每個鄰居(Q)
      # 如果Q還沒被看過
         # 把Q標示成看過
         # 找出Q的鄰居
         # 如果Q的鄰居數量>=MinPts (代表Q也是核心點)
            # 就把Q的鄰居也當成是P的鄰居
      # 如果Q還沒被加進任何群體(=>Q是還沒被領走的邊界點)
         # 就把Q加進群組C

# def 找出鄰居(點P、半徑epsilon)
   # 回傳所有鄰居

Advantage

返回 DBSCAN
返回 Clustering

不需事先指定群數
可以找出任何形狀的聚類
能分辨Noise
如果對資料有足夠的瞭解，可以選擇適當的參數來達到好的分類效果

Disadvantage

返回 DBSCAN
返回 Clustering

不同的資料順序可能會有不同的分群結果(幸運的是，這種情況不常見)
在高維度中，受到「維度災難」的影響，很難找出合適的
$ϵ$
如果資料庫裡的點有不同的密度，則無法得到好的分群結果

Reference

返回 DBSCAN
返回 Clustering

DBSCAN

tags: Machine Learning, Clustering

Idea

Algorithm

Advantage

Disadvantage

Reference

tags: `Machine Learning`, `Clustering`