# LP-Explain: Local Pictorial(圖像化) Explanation for Outliers
## 資料集處理
> 目前所有的datasety都找到了(paper裡面有附dataset連結)
> **目前只有Air quality Dataset 有做預處理(處理日期),其他的我目測不需要**
[Dataset link](https://drive.google.com/drive/folders/1tAzPU9wNP_MJOOFWN_VPuC-HgcRijrQz?usp=sharing)
**修正:**
這些function會回傳3個numpy array
1. data
2. outlier index
3. inlier index
```python=
from datetime import date
from sklearn.datasets import load_iris
from sklearn.ensemble import IsolationForest
from typing import Tuple
from numpy.typing import NDArray
import scipy.io
import pandas as pd
import numpy as np
# Air quality dataset
def read_airquality(filename: str) -> Tuple[NDArray, NDArray, NDArray]:
df = pd.read_csv(filename)
date_list = df['Date'].values
time_list = df['Time'].values
# data preprocess
for index in range(len(date_list)):
temp = date_list[index].split('/')
temp2 = time_list[index].split(':')
x = date(int(temp[2]),int(temp[1]),int(temp[0]))
date_list[index] = x.toordinal()
time_list[index] = temp2[0]
data = df.values
# outlier detector
clf = IsolationForest(n_estimators=30, random_state=10101)
pred = clf.fit_predict(data)
inlier_index = np.where(pred[:] == 1)[0]
outlier_index = np.where(pred[:] == -1)[0]
return data, outlier_index, inlier_index
# Iris dataset
def read_iris() -> Tuple[NDArray, NDArray, NDArray]:
data = load_iris()['data']
# outlier detector
clf = IsolationForest(max_samples=0.5, max_features=1.0, random_state=5)
pred = clf.fit_predict(data) # -1 represent outlier
inlier_index = np.where(pred[:] == 1)[0]
outlier_index = np.where(pred[:] == -1)[0]
return data, outlier_index, inlier_index
# Others dataset (breastw、pendigits、mammography、annthyroid)
def read_matfile(filename: str) -> Tuple[NDArray, NDArray, NDArray]:
mat = scipy.io.loadmat(filename)
data, label = mat['X'], mat['y']
inlier_index = np.where(label[:, 0] == 0)[0]
outlier_index = np.where(label[:, 0] == 1)[0]
return data, outlier_index, inlier_index
```
## 變數定義
D : dataset 資料集
F : feature
|F| = d 特徵數量
f~k~ : the k-th feature of F
A : outliers ;|A| = m 離群數量
FP: feature pair ;一組兩個特徵 
X∈R^mxn^, n=d(d-1)/2->FP可能總數, m->outlier數量
SQ : a set of feature pairs sequences (sq~1~,sq~2~, ...,sq~m~)
sq~k~: 用X~k,~排序過的feature pairs並挑選出top-l,l是自定義參數<=n (sq~1~^k^,sq~2~^k^, ..., sq~l~^k^)
C : outliers的分群結果 {C~1~,C~2~, ...,C~c~}
W=∈R^n×c^ : W~j,i~ 可做為cluster i 中第j個feature pairs的重要性評估 c->cluster
## 初始化參數設置
Isolation Fores產生X matrix
```python=
def get_combinations(d: int):
from itertools import combinations
return combinations(range(d), 2)
def get_outliers_index(data: np.ndarray) -> np.ndarray:
from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=0.5, max_features=1.0, random_state=5)
out = clf.fit_predict(data)
return np.where(out == -1)[0]
def get_outliers_score(data: np.ndarray) -> np.ndarray:
from sklearn.ensemble import IsolationForest
clf = IsolationForest(n_estimators=30, random_state=10101)
clf = clf.fit(data)
scores = -clf.decision_function(data)
return scores
def X_SQ_generation(data, l=7):
X = np.empty((m, n), dtype=np.float64)
for i, j in enumerate(get_combinations(d)): # d = 4, out = ((0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3))
X[:, i] = get_outliers_score(data[:, i])[outliers_index] # from OneClassSVM
X = (X - X.min()) / (X.max() - X.min()) # normalize to [0, 1]
SQ = np.argsort(X, axis=1)[:, ::-1][:, :l]
return X, SQ
```
{l,h,φ} -> {7,128,0.9}
{∀θi=1\|i∈ \[1,2, ..., c\]} c是使用self-tuning spectral clustering自動產生
## 評估指標
incrimination(%) =osFP~j~,j / osFP,j
osFP~j~,j 表示第j群的feature pairs FP~j~計算出的incrimination
osFP,j 表示第j群的所有feature pairs FP計算出的incrimination
incrimination計算方式參考
\[15\] N. Gupta, D. Eswaran, N. Shah, L. Akoglu, and C. Faloutsos, “Beyond outlier detection: Lookout for pictorial explanation,” inJoint Euro- pean Conference on Machine Learning and Knowledge Discovery in Databases, pp. 122–138, 2018.



## 公式
* 
* 先建立圖G,並用上述公式產生兩兩feature pair的相似度作為G的權重,再將其給Node2vec(G, h)產出向量V, h是使用者自定義的參數
* 
* 與圖一相似,只是改成用v作為計算的參數,v~k~∈R^h^
* 
* edge權重計算
* φ∈(0,1),調整top ranked偏好的參數
* max(i,j),i和j中選擇較大的那個rank
* 這邊的i和j(sq中的index)本身就是rank
* fp~simi~(sq~a~^i^, sq~b~^j^)為圖二的公式,sq~a~^i^為在fp~a~中的第i個feature,sq~b~^j^依此類推
* 所以公式計算是fp~a~中的第i個feature到fp~b~中的第j個feature的edge權重
* 
* 此公式是將sq~a~與sq~b~兩個向量使用Primal-Dual Maximum Weight Matching algorithm計算出兩向量的Maximum Weight Edge後將這些得出來的edge權重加權,最後乘上(1 - φ)。
* Primal-Dual Maximum Weight Matching algorithm在python中可以直接使用現有的library來計算,以下為公式4以python來實現的範例code:
```python=
import networkx as nx
import networkx.algorithms.matching as matching
G = nx.Graph() # empty graph
for i in range(len(sq_a)):
for j in range(len(sq_b)):
G.add_edge(i, j, weight = weight_a_b[i][j]) #建立fp_a與fp_b之間的edge與權重
M = matching.max_weight_matching(G, maxcardinality = True)
weight_sum = 0
for i, j in M:
weight_sum += weight_a_b[i][j]
weight_sum *= (1 - φ)
```
* 上面的範例code假設都已經將sq~a~與sq~b~的feature得出,兩個向量中的各個edge權重已經藉由公式3算出,所以在5~7行中可以直接建立graph並使用networkx裡的function直接得出Maximum Weight Edge最後加總
* 
* 
* 
* 
### IsolationForest
IsolationForest是一種用以異常檢測的算法,而X matrix即是使用本演算法產生的
python code如下,程式使用的是sklearn提供的IsolationForest,其與論文有相同的reference
```python=
from sklearn.ensemble import IsolationForest
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [0, 0], [-20, 50], [3, 5]])
clf = IsolationForest(n_estimators=30)
clf.fit(X) # fit 10 trees
print(clf.predict(X))
print(clf.decision_function(X))
#result
#predict->+1表示正常值, -1代表異常
#decision_function->異常評分
#[ 1 1 1 1 -1 -1]
#[ 0.14499332 0.14499332 0.07289614 0.10108986 -0.22917175 -0.07415002]
```
總共會抽出n*(n-1)/2
IsolationForest參數詳解
n_estimators : 森林中樹的棵數 (default=100)
max_samples : 每棵樹使用的樣本個數(default=”auto”)
max_features : 每棵樹使用的特徵個數 (default=1.0)
contamination : 使用者設置樣本中異常點的比例(default=0.1) (0.0~0.5)
目前猜測X matrix生成的程式碼如下
n個參數兩兩抽出算一個decision_function 成為X matrix

```python=
from sklearn.ensemble import IsolationForest
import numpy as np
import itertools
x = np.random.randint(-50, 50, size=(30, 6))#假設有30比資料 6個參數
pair_features_combination_len = len(x[0])*(len(x[0])-1)//2
X_matrix = np.zeros((len(x), pair_features_combination_len))
for i, pair_feature_index in enumerate(itertools.combinations(range(len(x[0])), 2)):
pair_feature = x[:, pair_feature_index]
clf = IsolationForest(n_estimators=30)
clf.fit(pair_feature)
X_matrix[:, i] = clf.decision_function(pair_feature).copy()
print(X_matrix)
```
### Node2Vec 【30】
由公式1及圖4獲得G,並透過現有[library](https://pypi.org/project/node2vec/)計算V
```python=
def generate_graph(X: np.ndarray):
m, n = X.shape
g = nx.Graph()
g.add_nodes_from(range(n))
for a, b in get_combinations(n):
fp_a = X[:, a]
fp_b = X[:, b]
e_w = 1 / (1 + np.sqrt(np.square(fp_a - fp_b)).sum())
g.add_edge(a, b, weight=e_w)
return g
def get_V(G: nx.Graph, h: int = 128) -> np.ndarray:
node2vec = Node2Vec(G, seed=10101)
word2vec = node2vec.fit()
V = []
for n in G.nodes:
V.append(word2vec.wv[str(n)])
return np.stack(V, axis=0)[:, :h] # (n, h)
```
### Primal-Dual Maximum Weight Matching Algorithm【31】
結合整個公式3,4及Primal-Dual Maximum Weight Matching
```python=
def fp_simi(vec_fp_a, vec_fp_b):
return 1 / (1 + np.sqrt(np.sum(np.power(vec_fp_a - vec_fp_b, 2))))
def maximum_weight_matching(SQ, V, l = 7, phi = 0.5):
#V 為get_V輸出的結果, SQ為X_SQ_generation輸出的SQ, SQ要用argsort
N = len(SQ)
S: np.ndarray = np.ones((N, N) , dtype = "float64")
for a, b in get_combinations(N):
BG = nx.Graph() # empty graph
e_bp: np.ndarray = np.zeros((l, l) , dtype = "float64")
s_ab = 0
for i, fp_a in enumerate(SQ[a]):
for j, fp_b in enumerate(SQ[b]):
e_bp[i][j] = fp_simi(V[fp_a], V[fp_b]) * np.power(phi, max(i, j))
BG.add_edge(i, j + l, weight = e_bp[i][j]) #建立fp_a與fp_b之間的edge與權重
M = matching.max_weight_matching(BG, maxcardinality = True)
for i, j in M:
i, j = min(i, j), max(i, j) - l
s_ab += e_bp[i][j]
s_ab *= (1 - phi)
S[a][b] = S[b][a] = s_ab
return S #Self-Tuning Spectral Clustering的輸入S
```
### Self-Tuning Spectral Clustering (STSC【32】)
#### 論文主要內容與工作
Clustering常用k-mean,當數據的分佈和事先假定的模型匹配度很高時能夠提供高品質的結果。然而,當數據以更複雜和未知的形狀排列時,這些方法效果不佳。另一種用於處理此類結構化數據的方法是Spectral Clustering,它只需要對點到點相似性矩陣進行光譜分析。
但Spectral Clustering有三個問題,論文對此進行改進。
#### 程式碼
```python=
import numpy as np
from stsc_ulti import affinity_to_lap_to_eig, reformat_result, get_min_max
from stsc_np import get_rotation_matrix as get_rotation_matrix_np
from stsc_autograd import get_rotation_matrix as get_rotation_matrix_autograd
from stsc_manopt import get_rotation_matrix as get_rotation_matrix_manopt
def self_tuning_spectral_clustering(affinity, get_rotation_matrix, min_n_cluster=None, max_n_cluster=None):
w, v = affinity_to_lap_to_eig(affinity)
min_n_cluster, max_n_cluster = get_min_max(w, min_n_cluster, max_n_cluster)
re = []
for c in range(min_n_cluster, max_n_cluster + 1):
x = v[:, -c:]
cost, r = get_rotation_matrix(x, c)
re.append((cost, x.dot(r)))
print('n_cluster: %d \t cost: %f' % (c, cost))
COST, Z = sorted(re, key=lambda x: x[0])[0]
return reformat_result(np.argmax(Z, axis=1), Z.shape[0])
def self_tuning_spectral_clustering_np(affinity, min_n_cluster=None, max_n_cluster=None):
return self_tuning_spectral_clustering(affinity, get_rotation_matrix_np, min_n_cluster, max_n_cluster)
def self_tuning_spectral_clustering_autograd(affinity, min_n_cluster=None, max_n_cluster=None):
return self_tuning_spectral_clustering(affinity, get_rotation_matrix_autograd, min_n_cluster, max_n_cluster)
def self_tuning_spectral_clustering_manopt(affinity, min_n_cluster=None, max_n_cluster=None):
return self_tuning_spectral_clustering(affinity, get_rotation_matrix_manopt, min_n_cluster, max_n_cluster)
```
### Feature pair selection
求出每個cluster的weight

X~ak,.~是outlier score一個outlier對所有fp經過isolation forest的值(假設10個feature pair的話,長度就是10*1)
W~.,i~是需要求的權重,長度與X~ak,.~相同
步驟
1. X~ak,.~與W~.,i~做element-wise相乘
所得到的向量代表一個outlier的所有feature pair乘某個權重(這個權重待求)
權重的限制條件 :

1. ||W||~2,1~

2. W矩陣中的每個element都要符合,0<=element<=1
2. max(step1)
代表對每個outlier都取出step1最大的乘積
3. 每一個outlier乘一個自訂的參數θ,每個cluster一個θ,θ的範圍 : (θ1, θ2, ..., θc) ∈ {0, 1}
4. 將同個cluster中的每個outlier所做出的step3都加總起來
因此每個cluster只會有一個權重
### CVX Solver【33】
>eq6

* **fp_num:** number of feature pairs
* **C:** cluster from 'Self-Tuning Spectral Clustering'
* **cluster_size:** every cluster's size
* **cluster_num:** number of clusters
* **W:** weight matrix (Wj,i for the jth feature pair in the ith cluster)
* **Z:** Z's shape (cluster_num, cluster_siez[i], fp_num)
* **summation:** eq6's summation
* **contraints:** List of constraints
* **lbda:** eq6's lambda
```python=
def cvx_solver(C: List, X: NDArray, fp_num: int, lbda: int) -> NDArray:
# parameters configuration:
cluster_size = [len(c) for c in C]
cluster_num = len(C)
W = cp.Variable((fp_num, cluster_num))
Z = [cp.Variable((cluster_size[i], fp_num)) for i in range(cluster_num)]
theta = np.ones((cluster_num))
summation = 0
constraints = [W >= 0, W <= 1, cp.mixed_norm(W, 2, 1) <= lbda]
# sigma(sigma(theta(z * X.T)))
for i in range(cluster_num):
for ak in range(cluster_size[i]):
summation += theta[i] * (Z[i][ak] @ X[C[i][ak]].T)
# eq6 constraints
temp_summation = 0
for j in range(fp_num):
temp_summation += Z[i][ak][j]
constraints.append(Z[i][ak][j] <= W[j][i])
constraints.append(temp_summation <= 1)
# solve problem
problem = cp.Problem(cp.Maximize(summation), constraints)
problem.solve()
return W.value
```
## 2021/12/12 程式執行結果
>目前6個dataset都跑過一次了,結果在github上(6個.ipynb檔)
>用.ipynb比較好測試0.0
>**我不確定outlier detector的參數怎麼調比較正確**,目前都是先隨便設,看出來的結果好不好
* [Air Quality Dataset](https://github.com/Eric07110904/NTUST-Data-Mining-Final/blob/master/airquality_demo.ipynb)
* [Iris Dataset](https://github.com/Eric07110904/NTUST-Data-Mining-Final/blob/master/iris_demo.ipynb)
* [Breast Cancer Dataset](https://github.com/Eric07110904/NTUST-Data-Mining-Final/blob/master/breastw_demo.ipynb)
* [Mammography Dataset](https://github.com/Eric07110904/NTUST-Data-Mining-Final/blob/master/mammograph_demo.ipynb)
* [Annthyroid Dataset](https://github.com/Eric07110904/NTUST-Data-Mining-Final/blob/master/annthyroid_demo.ipynb)
* [Pendigits Dataset](https://github.com/Eric07110904/NTUST-Data-Mining-Final/blob/master/pendigit_demo.ipynb)
## 問題與討論 (2021/12/11)
* 我覺得 X generation可能有問題,**isolation forest參數應該不能是寫死的**
* 因為不知道原論文參數是如何設定的,所以我們的**X**肯定不一樣,以此類推,**C、W也不一樣(出來的散點圖會跟paper不同)。**
* 我跑example.py (air quality dataset)跑很久ㄟ@@
* air quality dataset的outlier太多了導致要建很多的graph,我再看看能怎麼優化(maximum_weight_matching的部分)
* 有m個outlier就要建立 $m * (m - 1) / 2$個graph
* 我把算matching那邊稍微優化後有跑稍微快一點,然後把結果都進去跑分群跑超久都還沒跑出來...
>那個clustering的部分,paper最後是分8群,還是強制設8群跑看看?
>因為我run example.py的時候,印象中他分到20幾群還繼續跑0.0。
* 2021/12/08 疑問
這樣看起來read_airquality()會有問題,沒傳到filename

那我把read_airquality的filename寫成常數`dataset/AirQualityUCI_req.csv`
>**cvx solver那邊也push到 utils folder上了**
* 要怎麼計算每個feature的outlier scores (fp~i~)
* 我看他是說用任意可計算2d的outlier的outlier detection method來生成分數,應該跟你想得差不多
* 論文是是使用IsolationForest作為outlier detection method,詳細執行方式於IsolationForest那邊有解釋。
* 請問一下||W||~2,1~是什麼意思QWQ
* 
* 幫矩陣算的每一個col算長度ㄉ樣子
* 
* 謝大佬XD
###### tags: `資料探勘`