# Data Mining NTNU 資料探勘 ##### [Back to Note Overview](https://hackmd.io/@NTNUCSIE112/NTNUNote) {%hackmd @sophie8909/pink_theme %} ###### tags: `Data Mining` `111-1` `NTNU` <!-- tag順序 [課程] [學期][系 必選] or [學程 學程名(不含學程的 e.g. 大師創業)][學校] --> <!-- 記得加到 Note Overview --> ## Score - Midterm 25% - Final 25% - Homework 30% - Term Project 20% close book exam ## 0916 ![](https://i.imgur.com/p7nKGdv.png) - Two types of predictive modeling tasks - Classification (分類) - used for discrete target variables (預測列舉值) - E.g. 預測某位使用者會不會做線上購物 (Yes/No) - Regression (回歸分析) - used for continuous target variables (預測實數值) - E.g.預測股票收盤價 (a Value) - Classification: Definition - Given a collection of records (training set ) 給答案的樣本 訓練資料 - Each record contains a set of attributes, one of the attributes is the class. - Regression - Predict a value of a given continuous valued variable - Association Rule - Anomaly Detection (例外偵測) ## Classification: Basic Concepts, Decision Trees, and Model Evaluation ### Decision Tree Induction #### Algorithms - Hunt's Algorithm - CART - ID3 - C4.5 #### Performance Metric - Accuracy - Error Rate #### Specify Test Condition - attribute types - Binary - Nominal - Ordinal - Continuous - number of ways to split - 2-way - Multi-way #### Measures of Node Impurity(雜質) - DINI index ![](https://i.imgur.com/9RyVngo.png) ![](https://i.imgur.com/JR53aWk.png) - Entropy - Classification error ## Classifier ### Rule-based Classifier ### Nearest-Neighbor classifiers ### Bayesian Classifiers - A probabilistic framework for solving classification problems - Conditional Probability - $P(X|Y) = \frac{P(X, Y)}{P(Y)}$ - $P(Y|X) = \frac{P(X, Y)}{P(X)}$ - Bayes theorem - $P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}$ #### ROC (Receiver Operating Characteristic) - ROC curve plots - TPR (true positive rate) against - FPR (false positive rate) - Performance of each classifier represented as a point on the ROC curve - changing the threshold of algorithm - changes the location of the point ### Artificial Neural Network ### Support Vector Machines (SVM) <!-- 期中後 --> ### Ensemble Methods - 透過 training data 建構一個分類器的集合 - 由多個分類器做出 aggregating predictions 預測一個未曾見過的 class 的 label ![](https://i.imgur.com/fImoK4Q.png) - Probability that the ensemble classifier makes a wrong prediction(assume classifiers are independent) - $n$: classifiers 的數量 - $\epsilon$: error rate of each classifier $\sum^{n}_{n/2+1}\begin{pmatrix} 25\\i \end{pmatrix}\quad\epsilon^i(1-\epsilon)^{25-i}$ #### Bagging - An iterative procedure to adaptively change distribution of training data - focusing more on previously misclassified records - Initially, all N records are assigned equal weights - weights may change at the end of boosting round ## Ch.08 Cluster Analysis: Basic Concepts and Algorithms ### K-means ### Agglomerative Hierarchical Clustering ### DBSCAN - DBSCAN is a density-based algorithm. - Density = number of points within a specified radius (Eps) - A point is a **core point** if it has more than a specified number of points (MinPts) within Eps - A **border point** has fewer than MinPts within Eps, but is in the neighborhood of a core point - A **noise point** is any point that is not a core point or a border point. - ![](https://i.imgur.com/CpEzk47.png) - ![](https://i.imgur.com/QNKMF7k.png) - ![](https://i.imgur.com/egoyaiT.png) - Feature - Resistant to Noise 可抗雜訊 - Can handle clusters of different shapes and sizes 可以處理不同形狀大小的 clusters - Determining EPS and MinPts - Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance - Noise points have the kth nearest neighbor at farther distance - So, plot sorted distance of every point to its kth nearest neighbor ### Cluster Evaluation #### Measures of Cluster Validity #### Unsupervised Cluster Evaluation: Cohesion #### Determining The Correct Number of Clusters #### Unsupervised Cluster Evaluation: Silhouette Coefficient #### External Measures of Cluster Validity: Entropy #### External Measures of Cluster Validity: Purity #### External Measures of Cluster Validity: NMI ![](https://i.imgur.com/ZTmt6o6.png) $I(C, T) = \sum^r_{i=1}\sum^k_{j=1}p_{ij}\log(\frac{p_{ij}}{p_{c_i}\cdot p_{T_j}})$ ### Fuzzy c-means #### Hard (Crisp) vs Soft (Fuzzy) Clustering - Soft clustering - allow point $i$ to belong more than one cluster $j$ - Hard clustering - $w_{ij} \in \{0, 1\}$ #### Fuzzy c-means - ### Mixture Models (the EM Algorithm) #### Maximum Likelihood Estimation (MLE) #### Mixture model - A value $x_i$ is generated according to the following process - First select the nationality - With probability $\pi_G$ select Greek, with probability $\pi_C$ select China ($\pi_G + \pi_C = 1$) - Give the nationality, generate the point from the corresponding Gaussian ### Clustering as Graph Cuts #### The graph Laplacian ![](https://i.imgur.com/9sFWBzH.png) #### The Weight of Cut Edges - Assume yhat $C_1 = \{1, 2, 3, 4\}$ and $C_1 = \{5, 6, 7\}$ ![](https://i.imgur.com/kQCJNC6.png) - The cluster indicator vectors - $c_1 = (1, 1, 1, 1, 0, 0, 0)^T$ - $c_2 (0,0,0,0,1,1,1)^T$ - Laplacain matrix - ![](https://i.imgur.com/P0utof9.png) ### Spectral Clustering Algorithm #### Eigenvectors of Laplacian matrix $Lu_i = \lambda_i u_i$ #### Spectral Clustering Algorithm ![](https://i.imgur.com/QKqP33t.png) <!-- ```= spectral clustering(D, k): compute the similiartiy matrix A in R[n][n] if ratio cut then b <- L else if normalized cut then b <- Ls or La solve ``` -->