# Biomedical Data Mining
## 1. Introduction to data mining
### What is data mining
> 定義不一,因為研究者的領域差異很大
* knowledge extraction
* The process of identifying valid, novel
### Process
### Tasks
* Prediction
* classification 分類,預測類別
* regression 迴歸,預測數值
* probability estimation
* Description
* clustering 分群
* dependency modeling
* trend detection
#### Inductive learning - prediction
* Given: example of a function (x,f(x))
* Predict: $\hat f (x)$ of a new x
* Discrete f(x) - classification
* Continuous f(x) - regression
* f(x) = Prob(x) - probability estimation
#### Commonly used approaches
* **Decision Tree** (這門課重點)
* Rule induction
* Neural network
* Bayesian learning
* Evolution-based learning
* Instance-based learning
#### Cluster Analysis - description
## 2. Intro to Biology (分子生物學)
* 生物 (life) 因蛋白質 (protein) 表現出不同性狀
* protein structure matters
* 一級結構~四級結構
* Neuliec Acid 核酸
* 雙股 DNA / 單股 RNA - recipe of life
* 分子生物學中心法則
* 密碼子 (codon)
* 具有遺傳功能的核酸片段
* 這蠻重要的,後面 motif finding problem 會提到
## 3. Motif Finding Problem (in 1st structure)
## 4. Bayesian Learning
## 5. Decision Tree
> 優點是 comprehensibility (可解釋性)
> 缺點(?): unstable & sensitive
* Gini Index/Information Gain
* Missing Value
* Avoid ovefitting
* Pruning
* Limitation
## 6. Ensemble learning - Random Forest
> ensemble learning 應使用 sensitive learner,例如 DT, NN
* Bagging
* Boosting
* sensitive to noise
## 7. Advanced Motif finding
> 想從二級結構找 motif,並試圖推到三級結構以上
* RNA secondary structure
* Genetic Programming for RNA motifs
* GP 見演化計算
## 8. Microarray & Gene Expression Analysis
> 這算是生資領域獨立發展的工具
## 9. Evaluating Hypothesis
* Bias & Variance of a model
* 統計的假設檢定
## 10. Unbalanced Data
> 大部分做預測的演算法是假設用在 balanced data
* Preprocessing (make data balanced)
* Under-sampling
* Over-sampling
* Add sythesized data (e.g. SMOTE)
* Evaluation
* stratified sample (分層抽樣)
* Leave-One-Out Cross Validation (LOOCV)
* Bootstrap (取後放回)
## 11. Meta Learning
> 可視為 ensemble learning 的一種
> 算是 self-adpated learner (見演化計算)
* Stacking (Stacked Generalization)
* Cascade generalization
* Meta Decision Tree (MDT)
### Comparison of Ensemble Learning
* Voting
* Variance reduction
* Bagging
* Boosting
* Bias reduction
* stacking
* cascade generalization
* MDT
## 12. 實例探討: Patient Controlled Analgesia
> 病人自控麻醉器。
> 手術後病人感到疼痛可以自己按麻藥,此時機器應給病人多少劑量?
* 收集許多醫生的意見做 expert knowledge
* 最後預測結果跟隨便回答的醫生結果相近
* 仔細考慮很多限制條件的醫生反而成了 outlier