--- tags: 'ML notes' --- # 資料降維 Techniques for Dimensionality Reduction [Techniques for Dimensionality Reduction](https://towardsdatascience.com/techniques-for-dimensionality-reduction-927a10135356) 分三種 1. Feature Elimination and Extraction 2. Linear Algebra 3. Manifold (流形學習) ![](https://i.imgur.com/9G1jgKi.png) ## 1. Feature extraction and elimination ### Missing values ratio 缺失值比例 一個缺失值占比過高的變數不太能幫到ML Model - Implmentation ```python= # 判斷各變數中是否存在缺失值 df.isnull().any(axis = 0) # 各變數中缺失值的數量 df.isnull().sum(axis = 0) # 各變數中缺失值的比例 df.isnull().sum(axis = 0)/df.shape[0] ``` ### Low-variance filter 一個variance過低的變數不太能幫到ML Model,通常會設置一個variance threshold來過濾掉variance太低的變數 ```python= from sklearn.preprocessing import normalize # 記得要先把非數值資料暫時Drop掉再做下面的處理 print(data.dtypes) data = data.drop('輸入要drop掉的column',axis=1) # 觀察所有變數的variance有多大 以便設置threshold data_scaled = pd.DataFrame(normalize(data)) print(data_scaled.var()) # 用來儲存每個變數的variance variance = data_scaled.var() columns = data.columns # 儲存variance高於threshold的變數 variable = [ ] for i in range(0,len(variance)): if variance[i]>=0.006: # setting the threshold variable.append(columns[i]) # creating a new dataframe using the above variables new_data = data[variable] ``` 另一個Implementaion ```python= from sklearn.feature_selection import VarianceThreshold selector = VarianceThreshold(threshold=0.1) selector.fit_transform(X) print(selector.variances_) ``` ### High-correlation filter 檢測是否有一些變數他們彼此之間的關聯性很強,如果很強的話那其實只要取一個或少數變數來使用即可 #### Pearson’s Product Momentum Coefficient (Pearson correlation coefficient) Pearson的假設是 1. 兩變數都是常態分配 2. 兩變數的關係是線性的 3. 迴歸線上附近的資料都是均勻分布 r 值介於 -1 到 1 之間 ![](https://i.imgur.com/JWEvEMA.jpg) ```python= from sklearn.metrics import confusion_matrix import numpy as np import seaborn as sns import matplotlib.pyplot as plt # 記得要先把非數值資料暫時Drop掉或是做Encode後再做下面的處理 # 觀察Correlation Matrix corr = data.corr('pearson') # 預設其實就是pearson,直接data.corr()也可 sns.heatmap(corr, annot=True) # 把corrleation大於threshold的移除掉 columns = np.full((corr.shape[0],), True, dtype=bool) for i in range(corr.shape[0]): for j in range(i+1, corr.shape[0]): if corr.iloc[i,j] >= 0.9: # threshold if columns[j]: columns[j] = False selected_columns = data.columns[columns] new_data = data[selected_columns] ``` #### Spearman’s rank correlation coefficient 非線性關係的時候可以改用 Spearman,而且 Spearman 不需要資料是常態分配的假設,不過 Spearman 是基於 ranked data 而不是 raw data,適用於連續型和離散型資料 值介於 -1 到 1 之間 - 依資料中是否有相同rank的值分為兩種定義 1. 資料中有相同rank的值時 ![](https://i.imgur.com/gY88LLi.jpg) - $d_i$ = difference in paired ranks - $n$ = number of cases 2. 資料中沒有相同rank的值時 ![](https://i.imgur.com/AlEL2xk.jpg) - $i$ = paired score ```python= # 實作上只需要改變corr方式即可 corr = data.corr('spearman') ``` #### Kendall’s rank correlation coefficient (Kendall’s tau coefficient) 資料有強烈等級、順序性(的時候可以使用 Kendall,跟Spearman一樣是使用ranked data,適合離散型資料 可以用來測試出資料是Concordant還是Discordant - Concordant指的是兩個變數有同樣的排序方式 (Sign相同) - Discordant指的是兩個變數有不同的排序方式 (Sign相反) 假設 1. 變數是 ordinal 或是 continuous scale ,可以是連續和離散型 2. 資料服從 monotonic relationship (不一定要服從) ![](https://i.imgur.com/7lTRTs3.png) 值介於 -1 到 1 之間 - 依資料中是否有相同rank的值分為兩種定義 1. 資料中有相同rank的值時 ![](https://i.imgur.com/NsP1VjB.jpg) 2. 資料沒有相同rank的值時 ![](https://i.imgur.com/bauHXFL.jpg) ```python= # 實作上只需要改變corr方式即可 corr = data.corr('kendall') ``` #### Mutual Information (MI) 測量兩個變數之間互相依賴的程度,0代表兩個變數互相獨立。 比 correlation 還更泛用一點,測試的是在獲得一個隨機變數的資訊之後,另一個隨機變數所獲得的資訊量。 ![](https://i.imgur.com/2XQZGDU.jpg) - $D_{KL}$是KL散度 - $p(x,y)$是隨機變數$X$和$Y$的聯合機率分布 - $p(x)$和$p(y)$是邊緣機率分布 計算方式根據變數是離散或是連續型分為兩種 1. 若隨機變數$X$和$Y$是離散型 ($f$為pmf) ![](https://i.imgur.com/7wdGjVO.jpg) 2. 若隨機變數$X$和$Y$是連續型 ($f$為pdf) ![](https://i.imgur.com/Ic2VDwA.jpg) ```python= # 有 mutual_info_regression 和 mutual_info_classif兩種可用 from sklearn.feature_selection import mutual_info_classif from sklearn.feature_selection import SelectKBest # select the number of features you want to retain. select_k = 10 # get only the numerical features. numerical_x_train = x_train[x_train.select_dtypes([np.number]).columns] # create the SelectKBest with the mutual info strategy. selection = SelectKBest(mutual_info_classif, k=select_k).fit_transform(numerical_x_train, y_train) print(selection) ``` #### Chi-squared Score (卡方值) 測試類別型變數之間是否獨立,只能用在Binary target和Categorical variables上 ```python= # 把SelectKBest的score_func改成卡方就可 from sklearn.feature_selection import chi2 from sklearn.feature_selection import SelectKBest selection = SelectKBest(chi2, k=select_k).fit_transform(x_train, y_train) ``` #### ANOVA Univariate Test 一樣是測試類別型變數之間是否獨立 假設 1. X和Y之間的關係是線性 2. 變數都是常態分布 ```python= # 有 f_regression 和 f_classif 可用 from sklearn.feature_selection import f_classif from sklearn.feature_selection import SelectKBest selection = SelectKBest(f_classif, k=select_k).fit_transform(x_train, y_train) ``` ### Random forest 建一個決策樹去計算變數之間的Information Gain,據此來算出兩個變數之間是否獨立 回歸問題就改成RMSE之類的,分類問題就用roc_auc_score ```python= from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import roc_auc_score from sklearn.metrics import mean_squared_error # list of the resulting scores. roc_values = [] # 迴歸 RMSE = [] # loop over all features and calculate the score. for feature in x_train.columns: clf = DecisionTreeClassifier() clf.fit(x_train[feature].to_frame(), y_train) y_scored = clf.predict_proba(x_test[feature].to_frame()) roc_values.append(roc_auc_score(y_test, y_scored[:, 1])) # 迴歸的話就改成 RMSE.append(mean_squared_error(y_test, y_scored[:, 1], squared=False) # create a Pandas Series for visualisation. roc_values = pd.Series(roc_values) roc_values.index = X_train.columns # show the results. print(roc_values.sort_values(ascending=False)) ``` ### Backwards-feature elimination top-down 方法,從所有變數開始檢測是否刪除某個變數 ### Forward-feature construction bottom-up 方法,從只有完全沒有變數開始檢測是否加入某個變數 Forward 和 Backward 都可以利用 **SequentialFeatureSelector (SFS)** 來實作 - Sklearn版本 ```python= from sklearn.feature_selection import SequentialFeatureSelector from sklearn.ensemble import RandomForestClassifier # model estimator = RandomForestClassifier(random_state=101) # SFS預設是 forward, 可改成 backward 來變成 Backwards-feature elimination # 希望最後變成只有三個變數 sfs = SequentialFeatureSelector(estimator, n_features_to_select=3, direction='forward') sfs.fit(X, y) sfs.get_support() sfs.transform(X).shape ``` - mlxtend版本 - 跟sklearn的不同在於可以設定fixed_features來預設哪些feature不納入考慮 - fixed_features=(1, 3, 7), the 2nd, 4th, and 8th feature are guaranteed to be present in the solution [差異參考文章](https://axk51013.medium.com/scikit-learn-0-24-%E6%9B%B4%E6%96%B0-sequentialfeatureselector-%E4%BB%8B%E7%B4%B9-ed9b06e04326) ```python= from mlxtend.feature_selection import SequentialFeatureSelector from sklearn.ensemble import RandomForestClassifier # forward=True就是Forward, forward=False就是Backward sfs = SequentialFeatureSelector(RandomForestClassifier(), k_features=10, forward=True, floating=False, scoring='accuracy', cv=2) # fit the object to the training data. sfs = sfs.fit(x_train, y_train) # print the selected features. selected_features = x_train.columns[list(sfs.k_feature_idx_)] print(selected_features) # print the final prediction score. print(sfs.k_score_) # transform to the newly selected features. x_train_sfs = sfs.transform(x_train) x_test_sfs = sfs.transform(x_test) ``` ### Recursive feature elimination (RFE) and Recursive feature elimination and cross-validated (RFECV) ```python= from sklearn.feature_selection import RFECV from sklearn.feature_selection import RFE from sklearn.ensemble import RandomForestClassifier # model estimator = RandomForestClassifier(random_state=101) # 改成 selector = RFE(estimator, n_features_to_select=5, step=1) 就是 RFE selector = RFECV(estimator, step=1, cv=5) selector = selector.fit(X, y) selector.support_ selector.ranking_ ``` ## 2. Linear algebra methods ### Principal component analysis (PCA) [How Where and When we should use PCA](https://towardsdatascience.com/how-where-and-when-we-should-use-pca-ab3dddad5888) PCA 適合用有複雜的共線性關係的資料與高維資料、有雜訊以及需要壓縮的無標籤資料,是非監督式方法 **PCA 的假設是資料彼此之間不相關或是關係性不強烈** 將原空間中的任一向量投影到某低維子空間(reduced PCA space),找出一或多個最能代表數據 X 的 N 維向量並依此降維,principal components (PCs)代表的是covariance/correlation matrix 的 eigenvectors 擁有甚麼特性的向量最能代表? 1. 最大變異:降維後所得到的新 K 維特徵 L 具有最大的變異(Variance) 2. 最小錯誤:用 K 維的新特徵 L 重新建構回 N 維數據能得到最小的重建錯誤 PCA 的結果會獲得一個**原資料中variance最大的資料之間的線性組合** - The resulting projected data are essentially linear combinations of the original data capturing most of the variance in the data (Jolliffe 2002). ![](https://i.imgur.com/XVBx5vG.jpg) - sklearn - sklearn是使用SVD來將資料投影到低維 ```python= from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(X) print (pca.explained_variance_ratio_) print (pca.explained_variance_) X_new = pca.transform(X) plt.scatter(X_new[:, 0], X_new[:, 1],marker='o',c=y) plt.show() ``` - 畫出 biplot 的 PCA ```python= import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler plt.style.use('ggplot') # Load the data iris = datasets.load_iris() X = iris.data y = iris.target # apply z-score 也在 PCA 的步驟之中 scaler = StandardScaler() scaler.fit(X) X = scaler.transform(X) # The PCA model pca = PCA(n_components=2) # estimate only 2 PCs X_new = pca.fit_transform(X) # project the original data into the PCA space # 畫出PCA前後差異 fig, axes = plt.subplots(1,2) axes[0].scatter(X[:,0], X[:,1], c=y) axes[0].set_xlabel('x1') axes[0].set_ylabel('x2') axes[0].set_title('Before PCA') axes[1].scatter(X_new[:,0], X_new[:,1], c=y) axes[1].set_xlabel('PC1') axes[1].set_ylabel('PC2') axes[1].set_title('After PCA') plt.show() print(pca.explained_variance_ratio_) # Importances print(abs( pca.components_ )) # biplot def biplot(score, coeff , y): ''' Author: Serafeim Loukas, serafeim.loukas@epfl.ch Inputs: score: the projected data coeff: the eigenvectors (PCs) y: the class labels ''' xs = score[:,0] # projection on PC1 ys = score[:,1] # projection on PC2 n = coeff.shape[0] # number of variables plt.figure(figsize=(10,8), dpi=100) classes = np.unique(y) colors = ['g','r','y'] markers=['o','^','x'] for s,l in enumerate(classes): plt.scatter(xs[y==l],ys[y==l], c = colors[s], marker=markers[s]) # color based on group for i in range(n): #plot as arrows the variable scores (each variable has a score for PC1 and one for PC2) plt.arrow(0, 0, coeff[i,0], coeff[i,1], color = 'k', alpha = 0.9,linestyle = '-',linewidth = 1.5, overhang=0.2) plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'k', ha = 'center', va = 'center',fontsize=10) plt.xlabel("PC{}".format(1), size=14) plt.ylabel("PC{}".format(2), size=14) limx= int(xs.max()) + 1 limy= int(ys.max()) + 1 plt.xlim([-limx,limx]) plt.ylim([-limy,limy]) plt.grid() plt.tick_params(axis='both', which='both', labelsize=14) import matplotlib as mpl mpl.rcParams.update(mpl.rcParamsDefault) # reset ggplot style # Call the biplot function for only the first 2 PCs biplot(X_new[:,0:2], np.transpose(pca.components_[0:2, :]), y) plt.show() ``` - 畫出 heatmap 和 scatter 的 PCA ```python= import seaborn as sns import pandas as pd import numpy as np from tabulate import tabulate import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA X = iris.drop('species', axis=1).copy() y = iris['species'].copy() X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) print(f'The training set has {X_train.shape[0]} rows.') print(f'The test set has {X_test.shape[0]} rows.') scaler = StandardScaler() X_train_scaler = scaler.fit_transform(X_train) pca = PCA(random_state=42) X_train_pca = pca.fit_transform(X_train_scaler) train_iris = pd.DataFrame(np.concatenate([X_train_pca, np.array(y_train).reshape(-1, 1)], axis=1)) train_iris.rename(columns = {0: 'PC1', 1: 'PC2', 2: 'PC3', 3: 'PC4', 4: 'species'}, inplace=True) train_iris[['PC1', 'PC2','PC3','PC4']] = train_iris[['PC1', 'PC2','PC3', 'PC4']].astype(float) train_iris.corr() print(tabulate(train_iris.corr(), headers='keys', tablefmt='psql')) iris = sns.load_dataset('iris') X = iris.drop('species', axis=1).copy() y = iris['species'].copy() X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) print(f'The training set has {X_train.shape[0]} rows.') print(f'The test set has {X_test.shape[0]} rows.') scaler = StandardScaler() X_train_scaler = scaler.fit_transform(X_train) pca = PCA(random_state=42) X_train_pca = pca.fit_transform(X_train_scaler) train_iris = pd.DataFrame(np.concatenate([X_train_pca, np.array(y_train).reshape(-1, 1)], axis=1)) train_iris.rename(columns = {0: 'PC1', 1: 'PC2', 2: 'PC3', 3: 'PC4', 4: 'species'}, inplace=True) train_iris[['PC1', 'PC2','PC3','PC4']] = train_iris[['PC1', 'PC2','PC3', 'PC4']].astype(float) train_iris.corr() print(tabulate(train_iris.corr(), headers='keys', tablefmt='psql')) # Feautre Grouping fig, ax = plt.subplots(figsize=(12, 12)) plt.imshow(pca.components_.T, cmap = 'Spectral', vmin =-1, vmax = 1) plt.yticks(range(len(X_train.columns)), X_train.columns, fontsize=12) plt.xticks(range(4), range(1, 5), fontsize=12) plt.xlabel('Principal Components', fontsize=15) plt.ylabel('Features', fontsize=15) plt.title('Distribution of Features by Principal Components', fontsize=20) plt.colorbar() plt.show() # Dimension reduction without significant loss of information from prettytable import PrettyTable fig = plt.figure(figsize=(12,8)) fig.subplots_adjust(wspace=.4, hspace=.4) ax = fig.add_subplot(2, 1, 1) ax.bar(range(1, 1+pca.n_components_), pca.explained_variance_ratio_, color='#FFB13F') ax.set(xticks=[1, 2, 3, 4]) plt.yticks(np.arange(0, 1.1, 0.1)) plt.title('Explained variance', fontsize=15) plt.xlabel('Principal components', fontsize=13) plt.ylabel('% of explained variance', fontsize=13) ax = fig.add_subplot(2, 1, 2) ax.bar(range(1, 1+pca.n_components_), np.cumsum(pca.explained_variance_ratio_), color='#FFB13F') ax.set(xticks=[1, 2, 3, 4]) plt.yticks(np.arange(0, 1.1, 0.1)) plt.title('Cumulative explained variance', fontsize=15) plt.xlabel('Principal components', fontsize=13) plt.ylabel('% of explained variance', fontsize=13) plt.show() t = PrettyTable(['Component', 'Explained Variance', 'Cumulative explained variance']) principal_component = 1 cum_explained_var = 0 for explained_var in pca.explained_variance_ratio_: cum_explained_var += explained_var t.add_row([principal_component, explained_var, cum_explained_var]) principal_component += 1 print(t) # Visualization of multidimensional data fig = plt.figure(figsize=(12, 10)) ax = fig.add_subplot(111) img = ax.scatter(x=train_iris.loc[train_iris['species']=='virginica', 'PC1'], y=train_iris.loc[train_iris['species']=='virginica','PC2'], c='red', label='virginica', s=50) img = ax.scatter(x=train_iris.loc[train_iris['species']=='setosa', 'PC1'], y=train_iris.loc[train_iris['species']=='setosa','PC2'], c='green', label='setosa', s=50) img = ax.scatter(x=train_iris.loc[train_iris['species']=='versicolor', 'PC1'], y=train_iris.loc[train_iris['species']=='versicolor','PC2'], c='blue', label='versicolor', s=50) ax.set_xlabel(xlabel='sepal length', size=15) ax.set_ylabel(ylabel='sepal width', size=15) ax.set_title('Graph of Principal Components', size=20) plt.legend(title='Species') plt.show() # Part of the supervised learning process from sklearn.linear_model import LogisticRegression import datetime X_test_scaler = scaler.transform(X_test) X_test_pca = pca.transform(X_test_scaler) def train_and_check(Xtrain, Xtest, ytrain, ytest): classifier = LogisticRegression(max_iter=100000) start = datetime.datetime.now() classifier.fit(Xtrain, ytrain) end = datetime.datetime.now() time = (end - start).microseconds evaluation = classifier.score(Xtest, ytest) return evaluation, time results = PrettyTable(['Model', 'Accuracy', 'Training time (microseconds)']) # Training the model on untransformed data not_scaled_data = train_and_check(X_train, X_test, y_train, y_test) results.add_row(['not scaled data', not_scaled_data[0], not_scaled_data[1]]) # Training the model on transformed data scaled_data = train_and_check(X_train_scaler, X_test_scaler, y_train, y_test) results.add_row(['scaled data', scaled_data[0], scaled_data[1]]) # Training the model on the 4 Principal Components PC4_data = train_and_check(X_train_pca, X_test_pca, y_train, y_test) results.add_row(['4 Principal Components', PC4_data[0], PC4_data[1]]) # Training the model on the 3 Principal Components PC3_data = train_and_check(X_train_pca[:, :3], X_test_pca[:, :3], y_train, y_test) results.add_row(['3 Principal Components', PC3_data[0], PC3_data[1]]) # Training the model on the 2 Principal Components PC2_data = train_and_check(X_train_pca[:, :2], X_test_pca[:, :2], y_train, y_test) results.add_row(['2 Principal Components', PC2_data[0], PC2_data[1]]) # Training the model on the 2 Principal Components PC1_data = train_and_check(X_train_pca[:, :1], X_test_pca[:, :1], y_train, y_test) results.add_row(['1 Principal Components', PC1_data[0], PC1_data[1]]) print(results) ``` - References [How Where and When we should use PCA](https://towardsdatascience.com/how-where-and-when-we-should-use-pca-ab3dddad5888) [PCA clearly explained —When, Why, How to use it and feature importance: A guide in Python](https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e) ### Linear Discriminatory Analysis (LDA) LDA 跟 PCA 一樣是在尋找能夠最好代表資料的變數組合,可以用來做降維也可以用來做分類,LDA 只適用於有標籤的資料,是監督式方法,只要**不同類別**之間的 variance 最大化就好 **LDA 假設是輸入資料(X)為常態分布** ![](https://i.imgur.com/sOEddDz.png) ```python= from sklearn.discriminant_analysis import LinearDiscriminantAnalysis lda = LinearDiscriminantAnalysis(n_components=2) lda.fit(X,y) X_new = lda.transform(X) plt.scatter(X_new[:, 0], X_new[:, 1],marker='o',c=y) plt.show() ``` - References [機器學習: 降維(Dimension Reduction)- 線性區別分析( Linear Discriminant Analysis)](https://chih-sheng-huang821.medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E9%99%8D%E7%B6%AD-dimension-reduction-%E7%B7%9A%E6%80%A7%E5%8D%80%E5%88%A5%E5%88%86%E6%9E%90-linear-discriminant-analysis-d4c40c4cf937) [Linear Discriminant Analysis – Bit by Bit](https://sebastianraschka.com/Articles/2014_python_lda.html) ### Singular Value Composition (SVD) 奇異數分解 SVD 的目的是找出資料中最重要的features,比PCA更適合用在稀疏矩陣 這邊實作的是 sklearn 的 TruncatedSVD,比原始的SVD更泛用,因為 Truncated SVD 的 data 可以不是 centered 的,而且TruncatedSVD 能指定要計算出幾個奇異值最大的值 雖然類似於 PCA,但 PCA 是從 covariance matrix 中產生,而TruncatedSVD是直接從 data matrix 中產生 ```python= from sklearn.decomposition import TruncatedSVD svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42) svd.fit(X) TruncatedSVD(n_components=5, n_iter=7, random_state=42) print(svd.explained_variance_ratio_) print(svd.explained_variance_ratio_.sum()) print(svd.singular_values_) ``` - References [Recommender System — singular value decomposition (SVD) & truncated SVD](https://towardsdatascience.com/recommender-system-singular-value-decomposition-svd-truncated-svd-97096338f361) ## 3. Manifold 流形方法是利用幾何上的特性做到將資料投影到低維空間的同時又能保持其結構 ### Isomap embedding (Isometric Mapping) 產生一個embedded dataset來保存資料之間的關係 (保持 geodesic distances between all points) 步驟如下 1. 利用一個NN演算法產生一個neighborhood network 2. 利用一個最短路徑演算法算出測地路徑(geodesic distance),也就是所有資料點在曲面上的距離 3. 利用 geodesic distance matrix 的 eigenvalue decomposition來產生出一個低維度的embedding dataset,也就是用 geodesic distance matrix 去做 MDS Algo (Multidimensional scaling Algo) - Isomap ![](https://i.imgur.com/NZXi5lZ.jpg) - MDS ![](https://i.imgur.com/HbuOSWy.png) 實作上參數調整 1. n_neighbors : k-NN 的 k,預設是 5 2. n_components : 要降到幾維,預設是 2 3. path_method : 最短路徑演算法,有'auto', 'FW', 'D'可選,FW 是 Floyd-Warshall,D 是 Dijkstra’s ```python= from sklearn.datasets import load_digits from sklearn.manifold import Isomap X, _ = load_digits(return_X_y=True) print(X.shape) embedding = Isomap(n_components=2) X_transformed = embedding.fit_transform(X[:100]) print(X_transformed.shape) ``` 但 Isomap 在資料量大時會找很久,因為是在找全局最佳解 - References [Manifold learning](https://scikit-learn.org/stable/modules/manifold.html) [Nonlinear Dimensionality Reduction by Donovan Parks](https://www.cs.ubc.ca/~tmm/courses/533-07/slides/hidim.donovan-4x4.pdf) ### Locally linear embedding (LLE) LLE 類似 Isomap ,一樣是在保存資料彼此之間的關係的情況下將其投射到更低維的空間上,但 LLE 保證的是局部最佳解,以此基準來降維,而且 LLE 較能容忍 noise LLE 首先假設每個資料點與其周圍的資料是線性關係,去找出某一個點與他周圍k個點之間的距離,組合成一個線性組合,其實就是算 k-NN。 例如在原始的資料空間有一個點$x_1$想去算他周圍的$x_2$、$x_3$、$x_4$之間的距離,其線性組合就是$x_1=w_{12}x_2+w_{13}x_3+w_{14}x_4$ 之後我們在投影到低維空間後只需要確保這些權重不會改變或是改變的量很少即可確保其關係不會變化太大即可 步驟如下 1. 算k-NN,得到每個點的最近鄰居有哪些 2. 把每一個data都看成是他kNN的組合,也就是從每個資料點的最近鄰居點計算出他的局部重建權重矩陣 $W$ 3. 在不改變或是最小改變這些權重的情況下將資料投影到低維空間,也就是求出$(1-W)(1-W)^T$ 之中最小的 d 個特徵值所對應的 d 個特徵向量所組成的矩陣,這等同於是在求出能最小化權重變化的向量組合 ![](https://i.imgur.com/M0xaAbS.jpg) Roweis, S. T., Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000, 290(5500):2323. [局部线性嵌入(LLE)原理总结](https://www.cnblogs.com/pinard/p/6266408.html) LLE 有一個缺點是當鄰居數量過多,也就是資料點分布很集中,每一個點的鄰居數量高於輸入的維度時就會很沒有效率,另外是他找出的結果通常不會比Isomap的好,只是找得比較快 所以後來有一個方法是Modified Locally Linear Embedding (MLLE) 可以讓每一個鄰居有多個權重而非一個,以此來增進效率 在 sklearn 中只要在 LLE 中把 `method = 'modified'` 就能實作出來 實作上的參數調整 1. n_neighbors : k-NN 的 k 預設是 5 2. n_components : 想要降維到幾維,預設是 2 3. method : 有'standard', 'hessian', 'modified', 'ltsa' 可選擇,預設是standard regfloat, default=1e-3 regularization constant, multiplies the trace of the local covariance matrix of the distances. ```python= from sklearn.datasets import load_digits from sklearn.manifold import LocallyLinearEmbedding X, _ = load_digits(return_X_y=True) print(X.shape) embedding = LocallyLinearEmbedding(n_components=2, method='modified') X_transformed = embedding.fit_transform(X[:100]) print(X_transformed.shape) ``` ### t-Distributed Stochastic Neighbour (t-SNE) t-SNE 把資料之間的關係轉換成機率,最適合用來做資料視覺化,而且對local structure非常敏感,但他的計算複雜度很高,使用之前要 scale 資料並先做 missing values ratio check。 Isomap 和 LLE 適合用來解析**單一且連續的流形維度空間**,而 t-SNE 能夠解析局部的結構,可以把一部分的資料擷取出來分析,因此可以用在一個資料集中的**多個流形維度空間** 步驟如下 1. 計算原高維資料空間中資料點之間的相似度機率以及他們和其所對應的低維空間資料點之間的相似度機率 舉例來說,假設以資料點 A 為中心的常態分布中,按照機率密度的比率去取相鄰資料點的時候,這個資料點 A 會把資料點 B 作為他的相鄰點的條件機率有多少,這個條件機率就是點 A 和 B 的相似度 2. 計算高維空間和低維空間條件機率這兩個機率分布之間的 KL 散度 3. 利用梯度下降法去最小化他們之間的 KL 散度 缺點除了計算複雜度很高以外,還有他的隨機性會造成重現上的麻煩(但可以用 Random Seed 搞定)跟沒辦法保存全局的結構 (可以把資料點先做 PCA 來減輕這缺點, 在 sklearn 中加入 `init='PCA'`),除此之外就是他侷限在只能降維到一、二、三維上而已 實作上有幾個參數是可以根據情況去調整的 1. perplexity : 在資料量越大的時候就要越大,預設是30 2. init : 預設是 random,可改成 pca 3. random_state 4. n_iter : 迭代次數 5. n_components : 想要降維成幾維 ```python= import numpy as np from sklearn.manifold import TSNE X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]]) X_embedded = TSNE(n_components=2).fit_transform(X) print(X_embedded.shape) ``` - Reference [資料降維與視覺化:t-SNE 理論與應用](https://mropengate.blogspot.com/2019/06/t-sne.html)