# 統計與資料分析 Lecture3 ###### tags: `20200711` `statistics` 吳漢銘 台北大學統計學系 副教授 ## 大綱-資料處理方法 平滑技巧/遺失值處理/資料轉換/重抽法則 主題1 [進階選讀] * 移動平均(MovingAverage) * 曲線配適(FittingCurves):lowess * 核密度估計(KernelDensityEstimation) * 三次樣條插值(CubicSplineInterpolation) 主題2 * 具缺失值資料(MissingData) * 缺失機制(MissingnessMechanism) * MissingbyDesign * MissingCompletelyatRandom(MCAR) * MissingatRandom(MAR) * MissingNotatRandom(MNAR) 主題3 * R Packages for Dealing With Missing Values: VIM, MICE * Visualizing the Pattern of Missing Data * Traditional Approaches to Handling Missing Data * Imputation Methods: KNN * Which Imputation Method? 主題4 * 為什麼要做資料轉換? * 常見的資料轉換方式 * 對數轉換(LogTransformation) * Box-Cox Transformation * 標準化(Standardization) * 要使用哪一種資料轉換方式? 主題5 * Training data and Testing data * Resampling methods * Jackknife(leave-one-out) * Bootstrapping * Ensemble Learning * bagging * boosting 主題6 * Imbalanced Data Problem * under-sampling * over-sampling --- ## 具缺失值資料 (Missing Data) ## 缺失值的處理 ## 缺失機制Missingness Mechanism ## Missing by Design Missing Completely at Random ## Missing at Random (MAR) Missing Not at Random (MNAR) ## Some Notes ## Missing Values in R ## NA in Summary Functions ## NA in Modeling Functions ## Other Special Values in R ## R Packages for Dealing With Missing Values ## R Package: MICE ## Generates Multivariate Imputations by Chained Equations (MICE) ## Exploring Missing Data ## Visualizing the Pattern of Missing Data ## Matrix Plot ## Number of Observations Per Patterns for All Pairs of Variables ## Marginplot ## List-wise Deletion ## Pairwise Deletion ## Mean Substitution ## K-Nearest Neighbour Imputation ## ***kNN {VIM}:*** k-Nearest Neighbour Imputation ## ***matrixplot*** 自定平均函數 ## Which Imputation Method? ## Classical (Numerical) Data Table ## 為什麼要做資料轉換 ## 常見的資料轉換方式 ## 範例: Software Inspection Data ## 對數轉換 (Log Transformation) ## 對數轉換:How to Handle Negative Data Values? ## Box-Cox Transformations ## Modified Box-Cox Transformations ## 標準化 (Standardization) ## 範例: Standardization ## 範例: Microarray Data of Yeast Cell Cycle ## Standardization in Time Series Microarray Gene Expression Experiments ## 範例: Crab Data ## 範例: cDNA Microarray Gene Expression Data ## 要使用哪一種資料轉換方式? ## Classification:k-fold Cross-Validation Error Rates ## Split Data into Test and Train Set According to Group Labels ## Jackknife Resampling: Leave-one-out ## 自助法、拔靴法 Bootstrap Methods ## Bootstrapping ## ***bootstrap*** Package ## Bagging: Bootstrap Aggregating ## Boosting ## Example: Apply ***rpart*** to Vehicle Data ## ***adabag***: An R Package for Classification with Boosting and Bagging ## Example: 10-fold CV adaboost.M1 ## 不平衡資料問題 The Imbalanced Data Problem ## ***unbalanced*** Racing for Unbalanced Methods ## The Balancing Technique ## Ionosphere (電離層) dataset ***ubIonosphere {unbalanced}*** ## Compare the Performances using SVM ## ***ubRacing {unbalanced}*** Racing for Strategy Selection ## Racing for Strategy Selection ## Useful R Packages ## 進階選讀 ### Simple Moving Average ### Moving Average Acting as Resistance - Potential Sell Signal ### Smoothing in R ### 曲線配適 (Fitting Curves) ### ***lowess {stats}*** locally-weighted polynomial regression ### Density Plots (Smoothed Histograms) ### Kernel Density Estimation ### Spline approximate to the top profile of the ruddy duck ### ***smooth.spline {stats}***: Fit a Smoothing Spline ### Cubic Spline Interpolation