# EDM mid term
## Queston 1
### a. five number summary
* min = 18
* Q2 = 44
* Q3 = 61
* Q4 = 72
* max = 92
### b. min-max
:::info
min_max = $\frac{x-min}{max-min}$
:::
* 0.0
* 0.081
* 0.27
* 0.432
* 0.459
* 0.567
* 0.594
* 0.648
* 0.675
* 0.783
* 0.918
* 1.0
### c. partition
#### by depth
:::info
width = 12/3 = 4
:::
* 18, 24, 38, 50
* 52, 60, 62, 66
* 68, 76, 86, 92
#### by width
:::info
depth = 74/3 = 24.66
:::
* 18, 24, 38
* 50, 52, 60, 62, 66
* 68, 76, 86, 92
### d. smothing
#### width& mean
:::warning
四捨五入
:::
* 27, 27, 27
* 58, 58, 58, 58, 58
* 81, 81, 81, 81
#### width& boundaries
:::warning
四捨五入
:::
* 18, 18, 38
* 50, 50, 66, 66, 66
* 68, 68, 92, 92
#### depth& mean
:::warning
四捨五入
:::
* 33, 33, 33, 33
* 60, 60, 60, 60
* 68, 76, 86, 92
#### depth& boundaries
:::warning
四捨五入
:::
* 18, 18, 50, 50
* 52, 52, 66, 66
* 68, 68, 92, 92
## Question 2
### a. what is Naive means
to describe the assumption that all attributes are independent to others.
### b. F- Measure
* to evaluate the balance between recall rate& prescion rate
> * recall rate = $\frac{TP}{TP+FN}$, how many **True** items are recalled
> * prescion rate = $\frac{TP}{TP+FP}$, how many so-called **True** items are actually true
* F- measure = $\frac{TP}{TP+ \frac{1}{2} \cdot (FP+FN)}$
## Question 3
### a. Nominal Atributes
* Jaccard Index = $\frac{\abs{A \cap B}}{\abs{A \cup B}}$
### b. Term-Frequency Vectors
* Cosine Similarity = $\frac{A \cdot B}{\abs{A} \cdot \abs{B}}$
## Question 4
### 5 Vs
1. Volume
* big volume of data
2. Vriaty
* have variaty of structure
3. Value
* store valuable data, even they seems to be no relation to others
4. Volocity
* real time process
5. Veracity
* data collected from real world with accuracy
## Question 5
### a.
T
:::danger
WORNG
:::
### b.
F, the rule of selecting pattern of interest should be defined by the situation and the user.
### c.
F, different method may consider attributes in the diffrent priority orders, which will result in different decision trees.
### d.
F, no algotithm is best for every dataset, the best one sould be decided with the characteristics of dataset, situation and the user.
### e.
F, a model with the most down-to-earth data distribution is more possible to have better performance.
### f.
F, many figures like accuracy, presision, recall rate, F1 score, area under ROC curve, etc, should be considered in order to measure the performance of a model.
### g.
F, possible.
## Question 6
### a. Contingency Table
| | b | !b | sum |
|-|-|-|-|
|**c** |3|2|5|
|**!c** |4|1|5|
|**sum**|7|3||
||a|!a| sum |
|-|-|-|-|
|**d** |4|5|9|
|**!d** |1|0|1|
|**sum**|5|5||
||b|!b|sum|
|-|-|-|-|
|**d** |6|3|9|
|**!d** |1|0|1|
|**sum**|7|3||
||c|!c|sum|
|-|-|-|-|
|**e** |2|4|6|
|**!e** |2|2|4|
|**sum**|4|6||
||a|!a|sum|
|-|-|-|-|
|**c** |2|3|5|
|**c!** |3|2|5|
|**sum**|5|5||
### b. rank by descending order
#### support
> bc= 3
> ad= 4
> bd= 6
> ec= 2
> ca= 2
#### confidence
> {b} $\rightarrow$ {c} = 3/7
> {a} $\rightarrow$ {d} = 4/5
> {b} $\rightarrow$ {d} = 1
> {e} $\rightarrow$ {c} = 6/7
> {c} $\rightarrow$ {a} = 2/5
#### lift
:::info
Lift = $\frac{P(A|B)}{P(A)} = \frac{P(A \cap B)}{P(A) \cdot P(B)}$
:::
> {b} $\rightarrow$ {c} = $\frac{0.3}{0.5 \times 0.7}$
> {a} $\rightarrow$ {d} = $\frac{0.4}{0.9 \times 0.5}$
> {b} $\rightarrow$ {d} = $\frac{0.6}{0.9 \times 0.7}$
> {e} $\rightarrow$ {c} = $\frac{0.2}{0.4 \times 0.6}$
> {c} $\rightarrow$ {a} = $\frac{0.2}{0.5 \times 0.5}$
### c. Apriori Algorithm
#### one element
> d= 9
> b= 6
> e= 6
> a= 5
> c= 4
#### two elements
> db= 6
> de= 6
> da= 4
> dc= 4
> be= 4
> ~~ba= 3~~
> ~~bc= 3~~
> ea= 4
> ~~ec= 2~~
> ~~ac= 2~~
#### three elements
> dbe= 4
> ~~dba= 2~~
> dea= 4
> ~~bea= 2~~
> ...
## Question 7
### a. Comparison
* Decision Tree
* Pros:
* Easy to understand& explain
* Visualized
* Cons:
* Over fitting
* Sensitive to noise
* Not capable for complicated data
* Naive Bayse
* Pros:
* Easy to understand& explain
* Simple math
* Cons:
* Suggest all attributes are independent to others, which is generally not true
* Neurul Network
* Pros:
* Preform well for both simple and complicated data
* Cons:
* Difficult to understand
* Compute resources demanding
### b. K means clustering algorithm
* pros:
* Easy to process
* Capable for large data
* cons:
* Sensitive to noise
### c. Class imbalance
* definition
* the distrubution of classes are highly uneven in a certain dataset
* possible method to fix:
* undersampling: randomly remove data from major classes
* oversampling: randomly duplicate data from minor classes
## Question 8
### a. Cross Validation
* A process of splitting the database into k, then train a model with one sub database, and valid with others. Finally we can evaluate the Validation Error after k times of repeating 🔁, this value suggest how accurate the model is.
### b. False Negatives
* A figure of how many positive cases were falsely detected into negative.
## Question 9
### a. Information Gain
:::info
* entropy = $-p \cdot log_2(p) - q \cdot log_2(q)$
:::
* entropy(all) = $- \frac{4}{10} log_2(\frac{4}{10}) - \frac{6}{10} log_2(\frac{6}{10})$
* AT(4+, 3-)
* entropy(AT)= $- \frac{4}{7} log_2(\frac{4}{7}) - \frac{3}{7} log_2(\frac{3}{7})$
* AF(0+, 3-)
* entropy(AF)= $- \frac{0}{7} log_2(\frac{0}{7}) - \frac{3}{3} log_2(\frac{3}{3}) = 0$
* gain(A) = $entropy(all) - \frac{7}{10} \cdot entropy(AT) - \frac{3}{10} \cdot entropy(AF)$
* BT
* BF
* gain(B)
* compare gain(A)& gain(B), choose the ==larger one==
### b. GINI Index
:::info
Gini index = 1 - $\sum (P_i)^2$
:::
* Gini(all) = $1 - (\frac{4}{10})^2 - (\frac{6}{10})^2$
* AT(4+, 3-)
* Gini(AT) = $1 - (\frac{4}{7})^2 - (\frac{3}{7})^2$
* AF(0+, 3-)
* Gini(AF) = $1 - 0 - 1 = 0$
* Gini(A) = $Gini(all) - \frac{7}{10} \cdot Gini(AT) - \frac{3}{10} \cdot Gini(AF)$
* BT
* BF
* Gini(B)
* Compare Gini(A)& Gini(B), choose the ==smaller one==