###### tags: `Linear Algebra` `LA01` # L08 Improvement on K-Ks --- ## Last week - Unsupervised Learning - K-means Clustering - Our hand-made K-means --- ## This week - Data Re-scaling - Outliers - Improved K-Ks and prediction --- ### 1. Data Re-scaling ![](https://drive.google.com/uc?export=view&id=1n0-ShZ8dAgx_1yckdcSYLmgg6ISu9UQe) ---- ### 1.1 Simplest Rescaling <font size = "4">Run the following formular for each column (feature)</font>: $$x^{'} = \frac{x - min(X)}{max(X) - min(X)},\ where\ x\in X$$ **Little Challenge**: Given $max(X) > min(X)$, prove that: $$0 \le x^{'} \le 1$$ --- ### 2. Outliers Outliers are enemies of Machine Learning, most of the times. ![](https://drive.google.com/uc?export=view&id=14HwNTQjMtDsc2nUIffFuWdyYo1c6BgYp) ---- #### 2.1 Magic to detect Outliers >The local outlier factor is based on a concept of a local density, where locality is given by k nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers. -- Wikipedia ---- #### 2.2 We use sklearn <font size = "5"> The detail of LOF's algorithms is beyond the scope of this module. </font> <font size = "5"> We will simply use a sklearn magic here: </font> ``` from sklearn.neighbors import LocalOutlierFactor clf = LocalOutlierFactor(n_neighbors=5) outliers = clf.fit_predict(X) #This should label out outliers ``` <font size = "5"> So we can eliminate outliers when we are modelling out trainning set </font> --- ### 3. Performance evaluation <font size = "5"> For knn, we know that its performance can be evaluated by Accuracy. I.e. We try different k values, then chose the one that generate highest Accuracy. But How about K-means? </font> <font size = "4"> For K-means, we use an metric called **Pseudo F Statistics**</font> <font size = "4">$pseudo\ F = \frac{\frac{R^2}{c-1}}{\frac{1-R^2}{n-c}}$, the ratio of "Between-Cluster variace" and "In-cluster variace"</font> <font size = "3"> where $R^2 = 1 - \frac{sum\ of\ variance\ in\ each\ cluster}{overrall\ variance}$, $variance = ||x_i - mean(X)||^2$ c = number of cluster n = number of observations </font> --- ### Python Time :100: ----
{"metaMigratedAt":"2023-06-17T04:53:52.762Z","metaMigratedFrom":"YAML","title":"LA01_L08","breaks":true,"description":"Improvement of Knn and Kmeans.","slideOptions":"{\"theme\":\"serif\"}","contributors":"[{\"id\":\"d8479402-2b3f-4751-92f6-b67f55f4b94f\",\"add\":2708,\"del\":96}]"}
    142 views