LA01_L08 - HackMD

###### tags: `Linear Algebra` `LA01` # L08 Improvement on K-Ks --- ## Last week - Unsupervised Learning - K-means Clustering - Our hand-made K-means --- ## This week - Data Re-scaling - Outliers - Improved K-Ks and prediction --- ### 1. Data Re-scaling ![](https://drive.google.com/uc?export=view&id=1n0-ShZ8dAgx_1yckdcSYLmgg6ISu9UQe) ---- ### 1.1 Simplest Rescaling Run the following formular for each column (feature): $$x^{'} = \frac{x - min(X)}{max(X) - min(X)},\ where\ x\in X$$ **Little Challenge**: Given $max(X) > min(X)$, prove that: $$0 \le x^{'} \le 1$$ --- ### 2. Outliers Outliers are enemies of Machine Learning, most of the times. ![](https://drive.google.com/uc?export=view&id=14HwNTQjMtDsc2nUIffFuWdyYo1c6BgYp) ---- #### 2.1 Magic to detect Outliers >The local outlier factor is based on a concept of a local density, where locality is given by k nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers. -- Wikipedia ---- #### 2.2 We use sklearn The detail of LOF's algorithms is beyond the scope of this module. We will simply use a sklearn magic here: ``` from sklearn.neighbors import LocalOutlierFactor clf = LocalOutlierFactor(n_neighbors=5) outliers = clf.fit_predict(X) #This should label out outliers ``` So we can eliminate outliers when we are modelling out trainning set --- ### 3. Performance evaluation For knn, we know that its performance can be evaluated by Accuracy. I.e. We try different k values, then chose the one that generate highest Accuracy. But How about K-means? For K-means, we use an metric called **Pseudo F Statistics** $pseudo\ F = \frac{\frac{R^2}{c-1}}{\frac{1-R^2}{n-c}}$, the ratio of "Between-Cluster variace" and "In-cluster variace" where $R^2 = 1 - \frac{sum\ of\ variance\ in\ each\ cluster}{overrall\ variance}$, $variance = ||x_i - mean(X)||^2$ c = number of cluster n = number of observations --- ### Python Time :100: ----