EDM mid term - HackMD

# EDM mid term ## Queston 1 ### a. five number summary * min = 18 * Q2 = 44 * Q3 = 61 * Q4 = 72 * max = 92 ### b. min-max :::info min_max = $\frac{x-min}{max-min}$ ::: * 0.0 * 0.081 * 0.27 * 0.432 * 0.459 * 0.567 * 0.594 * 0.648 * 0.675 * 0.783 * 0.918 * 1.0 ### c. partition #### by depth :::info width = 12/3 = 4 ::: * 18, 24, 38, 50 * 52, 60, 62, 66 * 68, 76, 86, 92 #### by width :::info depth = 74/3 = 24.66 ::: * 18, 24, 38 * 50, 52, 60, 62, 66 * 68, 76, 86, 92 ### d. smothing #### width& mean :::warning 四捨五入 ::: * 27, 27, 27 * 58, 58, 58, 58, 58 * 81, 81, 81, 81 #### width& boundaries :::warning 四捨五入 ::: * 18, 18, 38 * 50, 50, 66, 66, 66 * 68, 68, 92, 92 #### depth& mean :::warning 四捨五入 ::: * 33, 33, 33, 33 * 60, 60, 60, 60 * 68, 76, 86, 92 #### depth& boundaries :::warning 四捨五入 ::: * 18, 18, 50, 50 * 52, 52, 66, 66 * 68, 68, 92, 92 ## Question 2 ### a. what is Naive means to describe the assumption that all attributes are independent to others. ### b. F- Measure * to evaluate the balance between recall rate& prescion rate > * recall rate = $\frac{TP}{TP+FN}$, how many **True** items are recalled > * prescion rate = $\frac{TP}{TP+FP}$, how many so-called **True** items are actually true * F- measure = $\frac{TP}{TP+ \frac{1}{2} \cdot (FP+FN)}$ ## Question 3 ### a. Nominal Atributes * Jaccard Index = $\frac{\abs{A \cap B}}{\abs{A \cup B}}$ ### b. Term-Frequency Vectors * Cosine Similarity = $\frac{A \cdot B}{\abs{A} \cdot \abs{B}}$ ## Question 4 ### 5 Vs 1. Volume * big volume of data 2. Vriaty * have variaty of structure 3. Value * store valuable data, even they seems to be no relation to others 4. Volocity * real time process 5. Veracity * data collected from real world with accuracy ## Question 5 ### a. T :::danger WORNG ::: ### b. F, the rule of selecting pattern of interest should be defined by the situation and the user. ### c. F, different method may consider attributes in the diffrent priority orders, which will result in different decision trees. ### d. F, no algotithm is best for every dataset, the best one sould be decided with the characteristics of dataset, situation and the user. ### e. F, a model with the most down-to-earth data distribution is more possible to have better performance. ### f. F, many figures like accuracy, presision, recall rate, F1 score, area under ROC curve, etc, should be considered in order to measure the performance of a model. ### g. F, possible. ## Question 6 ### a. Contingency Table | | b | !b | sum | |-|-|-|-| |**c** |3|2|5| |**!c** |4|1|5| |**sum**|7|3|| ||a|!a| sum | |-|-|-|-| |**d** |4|5|9| |**!d** |1|0|1| |**sum**|5|5|| ||b|!b|sum| |-|-|-|-| |**d** |6|3|9| |**!d** |1|0|1| |**sum**|7|3|| ||c|!c|sum| |-|-|-|-| |**e** |2|4|6| |**!e** |2|2|4| |**sum**|4|6|| ||a|!a|sum| |-|-|-|-| |**c** |2|3|5| |**c!** |3|2|5| |**sum**|5|5|| ### b. rank by descending order #### support > bc= 3 > ad= 4 > bd= 6 > ec= 2 > ca= 2 #### confidence > {b} $\rightarrow$ {c} = 3/7 > {a} $\rightarrow$ {d} = 4/5 > {b} $\rightarrow$ {d} = 1 > {e} $\rightarrow$ {c} = 6/7 > {c} $\rightarrow$ {a} = 2/5 #### lift :::info Lift = $\frac{P(A|B)}{P(A)} = \frac{P(A \cap B)}{P(A) \cdot P(B)}$ ::: > {b} $\rightarrow$ {c} = $\frac{0.3}{0.5 \times 0.7}$ > {a} $\rightarrow$ {d} = $\frac{0.4}{0.9 \times 0.5}$ > {b} $\rightarrow$ {d} = $\frac{0.6}{0.9 \times 0.7}$ > {e} $\rightarrow$ {c} = $\frac{0.2}{0.4 \times 0.6}$ > {c} $\rightarrow$ {a} = $\frac{0.2}{0.5 \times 0.5}$ ### c. Apriori Algorithm #### one element > d= 9 > b= 6 > e= 6 > a= 5 > c= 4 #### two elements > db= 6 > de= 6 > da= 4 > dc= 4 > be= 4 > ~~ba= 3~~ > ~~bc= 3~~ > ea= 4 > ~~ec= 2~~ > ~~ac= 2~~ #### three elements > dbe= 4 > ~~dba= 2~~ > dea= 4 > ~~bea= 2~~ > ... ## Question 7 ### a. Comparison * Decision Tree * Pros: * Easy to understand& explain * Visualized * Cons: * Over fitting * Sensitive to noise * Not capable for complicated data * Naive Bayse * Pros: * Easy to understand& explain * Simple math * Cons: * Suggest all attributes are independent to others, which is generally not true * Neurul Network * Pros: * Preform well for both simple and complicated data * Cons: * Difficult to understand * Compute resources demanding ### b. K means clustering algorithm * pros: * Easy to process * Capable for large data * cons: * Sensitive to noise ### c. Class imbalance * definition * the distrubution of classes are highly uneven in a certain dataset * possible method to fix: * undersampling: randomly remove data from major classes * oversampling: randomly duplicate data from minor classes ## Question 8 ### a. Cross Validation * A process of splitting the database into k, then train a model with one sub database, and valid with others. Finally we can evaluate the Validation Error after k times of repeating 🔁, this value suggest how accurate the model is. ### b. False Negatives * A figure of how many positive cases were falsely detected into negative. ## Question 9 ### a. Information Gain :::info * entropy = $-p \cdot log_2(p) - q \cdot log_2(q)$ ::: * entropy(all) = $- \frac{4}{10} log_2(\frac{4}{10}) - \frac{6}{10} log_2(\frac{6}{10})$ * AT(4+, 3-) * entropy(AT)= $- \frac{4}{7} log_2(\frac{4}{7}) - \frac{3}{7} log_2(\frac{3}{7})$ * AF(0+, 3-) * entropy(AF)= $- \frac{0}{7} log_2(\frac{0}{7}) - \frac{3}{3} log_2(\frac{3}{3}) = 0$ * gain(A) = $entropy(all) - \frac{7}{10} \cdot entropy(AT) - \frac{3}{10} \cdot entropy(AF)$ * BT * BF * gain(B) * compare gain(A)& gain(B), choose the ==larger one== ### b. GINI Index :::info Gini index = 1 - $\sum (P_i)^2$ ::: * Gini(all) = $1 - (\frac{4}{10})^2 - (\frac{6}{10})^2$ * AT(4+, 3-) * Gini(AT) = $1 - (\frac{4}{7})^2 - (\frac{3}{7})^2$ * AF(0+, 3-) * Gini(AF) = $1 - 0 - 1 = 0$ * Gini(A) = $Gini(all) - \frac{7}{10} \cdot Gini(AT) - \frac{3}{10} \cdot Gini(AF)$ * BT * BF * Gini(B) * Compare Gini(A)& Gini(B), choose the ==smaller one==

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.