# Biomedical Data Mining ## 1. Introduction to data mining ### What is data mining > 定義不一,因為研究者的領域差異很大 * knowledge extraction * The process of identifying valid, novel ### Process ### Tasks * Prediction * classification 分類,預測類別 * regression 迴歸,預測數值 * probability estimation * Description * clustering 分群 * dependency modeling * trend detection #### Inductive learning - prediction * Given: example of a function (x,f(x)) * Predict: $\hat f (x)$ of a new x * Discrete f(x) - classification * Continuous f(x) - regression * f(x) = Prob(x) - probability estimation #### Commonly used approaches * **Decision Tree** (這門課重點) * Rule induction * Neural network * Bayesian learning * Evolution-based learning * Instance-based learning #### Cluster Analysis - description ## 2. Intro to Biology (分子生物學) * 生物 (life) 因蛋白質 (protein) 表現出不同性狀 * protein structure matters * 一級結構~四級結構 * Neuliec Acid 核酸 * 雙股 DNA / 單股 RNA - recipe of life * 分子生物學中心法則 * 密碼子 (codon) * 具有遺傳功能的核酸片段 * 這蠻重要的,後面 motif finding problem 會提到 ## 3. Motif Finding Problem (in 1st structure) ## 4. Bayesian Learning ## 5. Decision Tree > 優點是 comprehensibility (可解釋性) > 缺點(?): unstable & sensitive * Gini Index/Information Gain * Missing Value * Avoid ovefitting * Pruning * Limitation ## 6. Ensemble learning - Random Forest > ensemble learning 應使用 sensitive learner,例如 DT, NN * Bagging * Boosting * sensitive to noise ## 7. Advanced Motif finding > 想從二級結構找 motif,並試圖推到三級結構以上 * RNA secondary structure * Genetic Programming for RNA motifs * GP 見演化計算 ## 8. Microarray & Gene Expression Analysis > 這算是生資領域獨立發展的工具 ## 9. Evaluating Hypothesis * Bias & Variance of a model * 統計的假設檢定 ## 10. Unbalanced Data > 大部分做預測的演算法是假設用在 balanced data * Preprocessing (make data balanced) * Under-sampling * Over-sampling * Add sythesized data (e.g. SMOTE) * Evaluation * stratified sample (分層抽樣) * Leave-One-Out Cross Validation (LOOCV) * Bootstrap (取後放回) ## 11. Meta Learning > 可視為 ensemble learning 的一種 > 算是 self-adpated learner (見演化計算) * Stacking (Stacked Generalization) * Cascade generalization * Meta Decision Tree (MDT) ### Comparison of Ensemble Learning * Voting * Variance reduction * Bagging * Boosting * Bias reduction * stacking * cascade generalization * MDT ## 12. 實例探討: Patient Controlled Analgesia > 病人自控麻醉器。 > 手術後病人感到疼痛可以自己按麻藥,此時機器應給病人多少劑量? * 收集許多醫生的意見做 expert knowledge * 最後預測結果跟隨便回答的醫生結果相近 * 仔細考慮很多限制條件的醫生反而成了 outlier