# Data Mining
NTNU 資料探勘
##### [Back to Note Overview](https://hackmd.io/@NTNUCSIE112/NTNUNote)
{%hackmd @sophie8909/pink_theme %}
###### tags: `Data Mining` `111-1` `NTNU`
<!-- tag順序 [課程] [學期][系 必選] or [學程 學程名(不含學程的 e.g. 大師創業)][學校] -->
<!-- 記得加到 Note Overview -->
## Score
- Midterm 25%
- Final 25%
- Homework 30%
- Term Project 20%
close book exam
## 0916

- Two types of predictive modeling tasks
- Classification (分類)
- used for discrete target variables (預測列舉值)
- E.g. 預測某位使用者會不會做線上購物 (Yes/No)
- Regression (回歸分析)
- used for continuous target variables (預測實數值)
- E.g.預測股票收盤價 (a Value)
- Classification: Definition
- Given a collection of records (training set ) 給答案的樣本 訓練資料
- Each record contains a set of attributes, one of the attributes is the class.
- Regression
- Predict a value of a given continuous valued variable
- Association Rule
- Anomaly Detection (例外偵測)
## Classification: Basic Concepts, Decision Trees, and Model Evaluation
### Decision Tree Induction
#### Algorithms
- Hunt's Algorithm
- CART
- ID3
- C4.5
#### Performance Metric
- Accuracy
- Error Rate
#### Specify Test Condition
- attribute types
- Binary
- Nominal
- Ordinal
- Continuous
- number of ways to split
- 2-way
- Multi-way
#### Measures of Node Impurity(雜質)
- DINI index


- Entropy
- Classification error
## Classifier
### Rule-based Classifier
### Nearest-Neighbor classifiers
### Bayesian Classifiers
- A probabilistic framework for solving classification problems
- Conditional Probability
- $P(X|Y) = \frac{P(X, Y)}{P(Y)}$
- $P(Y|X) = \frac{P(X, Y)}{P(X)}$
- Bayes theorem
- $P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}$
#### ROC (Receiver Operating Characteristic)
- ROC curve plots
- TPR (true positive rate) against
- FPR (false positive rate)
- Performance of each classifier represented as a point on the ROC curve
- changing the threshold of algorithm
- changes the location of the point
### Artificial Neural Network
### Support Vector Machines (SVM)
<!-- 期中後 -->
### Ensemble Methods
- 透過 training data 建構一個分類器的集合
- 由多個分類器做出 aggregating predictions 預測一個未曾見過的 class 的 label

- Probability that the ensemble classifier makes a wrong prediction(assume classifiers are independent)
- $n$: classifiers 的數量
- $\epsilon$: error rate of each classifier
$\sum^{n}_{n/2+1}\begin{pmatrix} 25\\i \end{pmatrix}\quad\epsilon^i(1-\epsilon)^{25-i}$
#### Bagging
- An iterative procedure to adaptively change distribution of training data
- focusing more on previously misclassified records
- Initially, all N records are assigned equal weights
- weights may change at the end of boosting round
## Ch.08 Cluster Analysis: Basic Concepts and Algorithms
### K-means
### Agglomerative Hierarchical Clustering
### DBSCAN
- DBSCAN is a density-based algorithm.
- Density = number of points within a specified radius (Eps)
- A point is a **core point** if it has more than a specified number of points (MinPts) within Eps
- A **border point** has fewer than MinPts within Eps, but is in the neighborhood of a core point
- A **noise point** is any point that is not a core point or a border point.
- 
- 
- 
- Feature
- Resistant to Noise 可抗雜訊
- Can handle clusters of different shapes and sizes 可以處理不同形狀大小的 clusters
- Determining EPS and MinPts
- Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance
- Noise points have the kth nearest neighbor at farther distance
- So, plot sorted distance of every point to its kth nearest neighbor
### Cluster Evaluation
#### Measures of Cluster Validity
#### Unsupervised Cluster Evaluation: Cohesion
#### Determining The Correct Number of Clusters
#### Unsupervised Cluster Evaluation: Silhouette Coefficient
#### External Measures of Cluster Validity: Entropy
#### External Measures of Cluster Validity: Purity
#### External Measures of Cluster Validity: NMI

$I(C, T) = \sum^r_{i=1}\sum^k_{j=1}p_{ij}\log(\frac{p_{ij}}{p_{c_i}\cdot p_{T_j}})$
### Fuzzy c-means
#### Hard (Crisp) vs Soft (Fuzzy) Clustering
- Soft clustering
- allow point $i$ to belong more than one cluster $j$
- Hard clustering
- $w_{ij} \in \{0, 1\}$
#### Fuzzy c-means
-
### Mixture Models (the EM Algorithm)
#### Maximum Likelihood Estimation (MLE)
#### Mixture model
- A value $x_i$ is generated according to the following process
- First select the nationality
- With probability $\pi_G$ select Greek, with probability $\pi_C$ select China ($\pi_G + \pi_C = 1$)
- Give the nationality, generate the point from the corresponding Gaussian
### Clustering as Graph Cuts
#### The graph Laplacian

#### The Weight of Cut Edges
- Assume yhat $C_1 = \{1, 2, 3, 4\}$ and $C_1 = \{5, 6, 7\}$

- The cluster indicator vectors
- $c_1 = (1, 1, 1, 1, 0, 0, 0)^T$
- $c_2 (0,0,0,0,1,1,1)^T$
- Laplacain matrix
- 
### Spectral Clustering Algorithm
#### Eigenvectors of Laplacian matrix
$Lu_i = \lambda_i u_i$
#### Spectral Clustering Algorithm

<!--
```=
spectral clustering(D, k):
compute the similiartiy matrix A in R[n][n]
if ratio cut then b <- L
else if normalized cut then b <- Ls or La
solve
``` -->