# 統計與資料分析 Lecture4 ###### tags: `20200711` `statistics` 吳漢銘 台北大學統計學系 副教授 > ## 畫圖很重要 ## 大綱-探索式資料分析 統計圖表 ## 主要參考書目 EDA with R: Course Content * Making exploratory **graphs** * Principles of analytic **graphics** * Plotting systems and **graphics** devices in R * The base, lattice, and ggplot2 **plotting** systems in R * Clustering methods (群集分析) * Dimension reduction techniques (維度縮減) --- ## John Tukey (1915~2000): 統計學界的畢卡索 > 問的問題要對 ## 「統計應該是科學,而非數學!」 ## What is EDA ## What Do They Say About EDA? ## Data Analysis Procedures ## 例子2: 川普推特誰寫的? 按讚數有不同 ### 有疑問? ### 發文時間對比 iOS Anroid 時間很不一樣 ### 發推文文字對比 ### 情感分析 川普用字比較強烈 透過Poisson test 用Log2當基礎,把資料的方向分出來了,好方法 ### 總結: 川普推特誰寫的? ## Infovis and Statistical Graphics: Different Goals, Different Looks ## Why Data Visualization? xy有四組,看統計數字完全一樣 但結果差很多 > 不看圖,會誤判 ## Anscombe's Quartet ## The Datasaurus Dozen ***install.packages("datasauRus")*** 假設要分類分群,但用錯方法,最多就50%正確率 要看到資料才能走下一步 ## The Datasaurus Dozen More examples 即使看到正向關係,但是有其他變數有相關。 ## Graphical Perception ## Index Plot ## 直方圖 (Histogram) ## 圖表的誤用 ## 範例: rgl, explore a comet ## Complex Heatmap 把數字變成顏色 ## Applications: Array Image 訊號都是隨機的,但圖型都偏一邊。 單看欄位是無法判斷的 ## Applications: Mobile Data 經緯換成行政區,然後用顏色代表 第二天早上跟第四天早上都沒資料 1.數據資料的圖形很重要,要看到資料的樣子,才能夠知道下一步要怎麼走 2.看圖先看尺度 3.heat map把數字變成顏色來表示,可以觀察出一些現象 4.資料整理與轉換很重要 ## Applications: Eye-tracking, mouse clicking ### 讀取外部影像檔案 ### 台灣地圖 ### 於地圖上標記 ## Big Data: The Era of 9 Vs ## The Challenge of Visualizing Big Data ## How Can We Visualize and Interact with Billion+ Record Databases in Real-time? ## ***hexbin*** Package: Hexagonal Binning Routines ## ***tabplot***: Tableplot, a Visualization of Large Datasets #### tableplot(iris, nBins=150, sortCol=5) #### tableplot(iris, nBins=50, sortCol=4) #### tableplot(diamonds) ## Symbolic Data Analysis (Billard and Diday, JASA 2003) ## 推薦書目 > 1. **AGGREGATION** From Tables ans Means to Least Squares > 2. **INFORMATION** Its Measurement and Rate of Change > 3. **LIKELIHOOD** Calibration on a Probability Scale > 4. **INTERCOMPARISON** Within-Sample Variation as a Standard > 5. **REGESSION** Multivariate Analysis, Bayesian Inference, and Causal Inference > 6. **DESIGN** Experimental Planning and the Role of Randomization > 7. **RESIDUAL** Scientific Logic, Model Comparsion, and Diagnostic Display ## 資料無邊界! ## 未來方向? ## 進階選讀 ### 例子1: The Doubs Fish Data #### River Doubs Map #### The Doubs Fish Data: 檔案 #### The Doubs Fish Data: 前置處理 #### Data Extraction: Read Data #### Species Data: First Contact Basic functions #### Overall Distribution of Abundances (Dominance Codes) #### Species Data: A Closer Look Map of the Locations of the Sites #### 註: 重建 Reconstruction #### Maps of Some Fish Species #### Compare Species: Number of Occurrences #### Compare Sites: Species Richness #### Compute Alpha Diversity Indices of the Fish Communities #### Transformation and Standardization of the Species Data #### Scale Abundances by Dividing Them by the Site Totals #### Compute Relative Frequencies by Rows (Site Profiles) #### Standardization by Both Columns and Rows #### Boxplots of Transformed Abundances of a Common Species (Stone Loach) #### Plot Profiles Along the Upstream-Downstream Gradient #### Bubble Maps of Some Environmental Variables #### Examine the Variation of Some Descriptors Along the Stream: Line Plots #### Scatter Plots for All Pairs of Environmental Variables #### Simple Transformation of An Environmental Variable #### Standardization of All Environmental Variables ### 小結 & 想想看 ### ***anscombe {datasets}*** Anscombe's Quartet of ‘Identical’ Simple Linear Regressions ### Extensions of Scatterplots ### MA plot Scatterplot for Gene Expression Data ### ***heatmap {stats}*** ### 類別資料的視覺化: vcd Visualizing Categorical Data